cpsc 540: machine learningschmidtm/courses/540-w17/l18.pdfcpsc 540: machine learning structure...

Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines

CPSC 540: Machine LearningStructure Learning, Structured SVMs

Mark Schmidt

University of British Columbia

Winter 2017


Admin

Assignment 4:

Due today, 1 late day for Wednesday, 2 for the following Monday.

No office hours tomorrow.

Project proposals: “no news is good news”.

Assignment 5 and project/final descriptions are coming soon.


Last Time: Structured PredictionWe discussed structured prediction:

Supervised learning where output y is a general object.

For example, automatic brain tumour segmentation:

We want to label all pixels and model depeendencies between pixels.

We could formulate this as a density estimation problem of modeling p(x, y).Here y is the labeling of the entire image.But features x may be complicated.

CRFs generalize logistic regression and directly model p(y|x).


Outline


Ising Models

The Ising model for binary xi is defined by

p(x1, x2, . . . , xd) =1

Zexp

d∑i=1

xiwi +∑

(i,j)∈E

xixjwij

.

Consider using xi ∈ {−1, 1}:If wi > 0 it encourages xi = 1.If wij > 0 it encourages neighbours i and j to have the same value.

E.g., neighbouring pixels in the image receive the same label (“attractive” model)

This model is a special case of a pairwise UGM with

φi(xi) = exp(xiwi), φij(xi, xj) = exp(xixjwij).


General Pairwise UGM

For general discrete xi a generalization is

p(x1, x2, . . . , xd) =1

Zexp

d∑i=1

wi,xi +∑

(i,j)∈E

wi,j,xi,xj

,

which can represent any “positive” pairwise UGM (meaning p(x) > 0 for all x).

Interpretation of weights for this UGM:

If wi,1 > wi,2 then we prefer xi = 1 to xi = 2.If wi,j,1,1 > wi,j,2,2 then we prefer (xi = 1, xj = 1) to (xi = 2, xj = 2).

As before, we can use parameter tieing:

We could use the same wi,xifor all positions i.

Ising model corresponds to tieing of the wi,j,xi,xj.


Log-Linear Models

These models are special cases of log-linear models which have the form

p(x|w) = 1

Zexp

(wTF (x)

),

for some parameters w and features F (x).

The log-linear NLL is convex and has the form

− log p(x|w) = −wTF (x) + log(Z),

and the gradient can be written as

−∇ log p(x|w) = −F (x) + E[F (x)].

So if the gradient is zero, the empirical features match the and expected features.


Training Log-Linear Models

The term E[F (x)] in the gradient may be hard to compute.

In a pairwise UGM, it depends on univariate and pairwise marginals.

It’s common to use variational or Monte Carlo estimates of these marginals.

In RBMs, we alternate between block Gibbs sampling and stochastic gradient.

Or a crude approximation is pseudo-likelihood,

p(x1, x2, . . . , xd) ≈d∏j=1

p(xj |x−j),

which turns learning into d single-variable problems (similar to DAGs).


Pairwise UGM on MNIST Digits

Samples from a lattice-structured UGM:

Training: 100k stochastic gradient w/ Gibbs sampling steps with αt = 0.01.

Samples are iteration 100k of Gibbs sampling with fixed w.


Structure Learning in UGMs

The problem of choosing the graph is called structure learning.

Generalizes feature selection: we want to find all relationships between variables.

Finding optimal tree is a minimum spanning tree problem.

“Chow-Liu algorithm”: based on pairiwse mutual information

NP-hard for non-tree DAG and UGMs.

For DAGs, we usually do a greedy search through space of acyclic graphs.

For Ising UGMs, we can use L1-regularization of wij values.

If wij = 0, then we remove dependency.

For discrete UGMs, we can use group L1-regularization of wi,j,xi,xj values.

If wi,j,xi,xj= 0 for all xi and xj , we remove dependency.


Structure Learning on Rain Data

Large λ (and optimal tree):

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28 Small λ:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

16

15

17

18

19

20

21

22

23

24

28

25

26

27


Structure Learning on USPS DigitsStructure learning of pairwise UGM with group-L1 on USPS digits:

1,2

2,12,2

1,3

2,3

1,4

1,5

2,41,6

2,51,7

2,61,8

2,71,9

2,81,10

2,91,11

2,101,12

2,111,13

2,121,14

2,13

1,15

2,14

2,15 1,16

2,16

3,16

3,13,2

3,3

3,4

3,5

3,6

3,7

3,8

3,9

3,10

3,11

3,12

3,13

3,14

3,15

4,1

4,2

4,3

4,4

4,5

4,6

4,7

4,8

4,9

4,10

4,11

4,12

4,13

4,14

4,15

4,16

5,1

5,2

5,3

5,4

5,5

5,6

5,7

5,8

5,9

5,10

5,11

5,12

5,13

5,14

5,15

5,16

6,1

6,2

6,3

6,4

6,5

6,6

6,7

6,8

6,9

6,10

6,11

6,12

6,13

6,14

6,15

6,16

7,1

7,2

7,3

7,4

7,5

7,6

7,7

7,8

7,9

7,10

7,11

7,12

7,13

7,14

7,15

7,16

8,1

8,2

8,3

8,4

8,5

8,6

8,7

8,8

8,9

8,10

8,11

8,12

8,13

8,14

8,15

8,16

9,1

9,2

9,3

9,4

9,5

9,6

9,7

9,8

9,9

9,10

9,11

9,12

9,13

9,14

9,15

9,16

10,15

10,1

10,2

10,3

10,4

10,5

10,6

10,7

10,8

10,9

10,10

10,11

10,12

10,13

10,14

10,16

11,15

11,1

11,2

11,3

11,4

11,5

11,6

11,7

11,8

11,9

11,10

11,11

11,12

11,13

11,14

11,16

12,15

12,1

12,2

12,3

12,4

12,5

12,6

12,7

12,8

12,9

12,10

12,11

12,12

12,13

12,14

12,16

13,15

13,1

13,2

13,3

13,4

13,5

13,6

13,7

13,8

13,9

13,10

13,11

13,12

13,13

13,14

13,16

14,15

14,1

14,2

14,3

14,4

14,5

14,6

14,7

14,8

14,9

14,10

14,11

14,12

14,13

14,14

14,16

15,1

15,2

15,3

15,4

15,5

15,6

15,7

15,8

15,9

15,10

15,11

15,12

15,13

15,14 15,15 15,16

16,1

16,2

16,3

16,4

16,5

16,6

16,7

16,8

16,9

16,10

16,11

16,12

16,13

16,14

16,15


Structure Learning on USPS DigitsOptimal tree on USPS digits:

1,1

2,11,2

2,2

1,3

2,3

1,4

2,4

1,5

2,5

1,6

2,61,7

2,7

1,8

2,8

1,9

2,9

1,10

2,10

1,11

2,11 1,12

2,121,13

2,131,14

2,14

1,15

2,15 2,16

1,16

3,2 3,3 3,4 3,5

3,6

3,8 3,9 3,10 3,11

3,12

3,143,153,16

3,1

4,1 4,2 4,3 4,4

3,7 4,5

4,6

4,8 4,9 4,10

4,113,13

4,144,15

5,1 5,2 5,3

5,4

5,5

4,7

5,8 5,9

4,12

5,11

4,13

5,13

5,144,16

5,16

6,1 6,2 6,3

6,45,6

6,56,6

5,7

6,76,8 6,9

5,10

6,10 6,11

5,12

6,12 6,13

5,15

6,15

7,1 7,2 7,3

7,4

7,5 7,67,77,8 7,9 7,10 7,11 7,12

6,14

7,14

6,16

7,15

8,1 8,2 8,3

8,4

8,5 8,68,78,8 8,9 8,10 8,11

7,13

8,128,13

8,148,15

7,16

9,1 9,2 9,3

9,4

9,59,79,8 9,9 9,10

9,119,12

9,149,15

8,16

10,1 10,2 10,3

10,49,6

10,510,610,710,8 10,9

10,1010,11 9,13

10,12 10,1310,1410,15

9,16

10,16

11,1 11,2 11,3

11,4

11,511,6 11,711,8

11,911,10

11,11 11,12 11,1311,1411,1511,16

12,1 12,2 12,3

12,412,5 12,612,7 12,8

12,1212,1312,1412,1512,16

13,1

13,5

12,9

12,10

12,11

13,1213,1313,14

13,2 14,1

13,3

14,3

13,4

13,6

14,513,7

14,6

13,8

13,9

13,10

13,11

14,1314,14

13,15

14,15

13,16

14,2

15,2

15,3 14,4

15,4

15,5 14,7

15,6 15,7

14,8

14,9 15,8

14,10

14,11

14,12

15,1215,13 15,14 14,1615,15

15,1

16,1

16,2

16,3

16,4

16,5 16,6 16,7

16,8

15,9

15,1016,9

15,11 16,10

16,11

16,1216,13 16,14 15,16 16,15

16,16


20 Newsgroups Data

Data containing presence of 100 words from newsgroups posts:

car drive files hockey mac league pc win

0 0 1 0 1 0 1 00 0 0 1 0 1 0 11 1 0 0 0 0 0 00 1 1 0 1 0 0 00 0 1 0 0 0 1 1

Structure learning should give relationship between words.


Structure Learning on News WordsOptimal tree on news Words:

aids

baseball

hit

bible

bmw

cancer

car

dealer engine honda

card

graphics video

case

children

christian

computercourse

data

disease

disk

drive memory system

display

server

doctor

dos

scsi

driver

earth

god

email

ftp phone

oil

evidence

fact

question

fans

files

format windows

food

msg water

image

games

jesus religion

government

power president rights state war

gun

health

insurance medicine

help hockey

nhl

human

israel

jews

launch

law

league

lunar

mac

mars

patients studies

mission

moon nasa

number

orbit

satellite solar

vitamin

pc

software

players

problem

program

space

puck

research science

seasonshuttle technology

university

team

version

world

win

won


Structure Learning on News WordsGroup-L1 on news words:

baseball

games

league

players

bible

christian

god

jesus

question

car

dealerdrive engine

card

driver

graphics

pc

problem

system

video

windows

case

course

evidence

fact

government

human

lawnumber power

rights

state

world

children

president

religionwar

computer

data

email

program

science

software

university

memory

research

space

disk

files

display

imagedos

mac scsi

earth

orbit

format

ftp

help

phone

jews

fans

hockey

team

version

nhl

season

win

gun

health

insurance

israel

launch moon

nasa

shuttle

technology

won


Structure Learning on News Words

Group-L1 on news words:

baseball

games

league

players

bible

christian

god

jesus

question

car

dealerdrive engine

card

driver

graphics

pc

problem

system

video

windows

case

course

evidence

fact

government

human

lawnumber power

rights

state

world

children

president

religionwar

computer

data

email

program

science

software

university

memory

research

space

disk

files

display

imagedos

mac scsi

earth

orbit

format

ftp

help

phone

jews

fans

hockey

team

version

nhl

season

win

gun

health

insurance

israel

launch moon

nasa

shuttle

technology

won

baseball

games

league

players

bible

christian

god

jesus

question

car

dealerdrive engine

card

driver

graphics

pc

problem

system

video

windows

case

course

evidence

fact

government

human

lawnumber power

rights

state

world

children

president

religionwar

computer

data

email

program

science

software

university

memory

research

space

disk

files

display

imagedos

mac scsi

earth

orbit

format

ftp

help

phone

jews

fans

hockey

team

version

nhl

season

win

gun

health

insurance

israel

launch moon

nasa

shuttle

technology

won


Outline


Rain Data without Month Information

Consider an Ising model for the rain data with tied parameters,

p(y1, y2, . . . , yd) =1

Zexp

d∑i=1

yiw +

d∑j=2

yjyj−1v

.

First term reflects that “not rain” is more likely.

Second term reflects that consecutive days are more likely to be the same.

But how can we that “some months are less rainy”?


Rain Data with Month Information: Boltzmann Machine

We could add 12 binary latent variable zj ,

p(y1, y2, . . . , yd, z) =1

Zexp

d∑i=1

yiw +

d∑i=2

yiyi−1v +

d∑i=1

12∑j=1

yizjv2 +

12∑j=1

zjw2

,

which is a variaton on a Boltzmann machine.

Modifies the probability of “rain” for each of the 12 values.

Inference is more expensive due to the extra variables.


Rain Data with Month Information: MRF

If we know the months we could add an explicit month feature xj

p(y1, y2, . . . , yd, x) =1

Zexp

d∑i=1

yiw +

d∑i=2

yiyi−1v +

d∑i=1

12∑j=1

yixjv2 +

12∑j=1

xjw2

,

Learning might be easier: we’re given known clusters.

But still have to model distribution x.

It’s easy in this case because months are uniform.But in other cases we may want to use a complicated x.And inference is more expensive than chain-structured models.


Rain Data with Month Information: CRF

In conditional random fields we fit distribution conditioned on x,

p(y1, y2, . . . , yd|x) =1

Zexp

d∑i=1

yiw +

d∑i=2

yiyi−1v +

d∑i=1

12∑j=1

yixjv2

.

Now we don’t need to model x.

Just need to figure out how x affects y.

The conditional UGM given x has a chain-structure

φi(yi) = exp

yiw +12∑j=1

yixjv2

, φij(yi, yj) = exp(yiyjv),

so inference can be done using forward-backward.


Rain Data with Month Information

Samples from CRF conditioned on x for December and July:

Code available as part of UGM package.


Brain Tumour Segmentation with Label Dependencies

We could label all voxels i as “tumour” or not using logistic regression,

p(yi|xi) = exp(yiwTxi)

exp(wTxi) + exp(−wTxi)

But this misses dependence in labels yi:

We prefer neighbouring voxels to have the same value.



With independent logistic, joint distribution over all voxels is

p(y1, y2, . . . , yd|x1, x2, . . . , xd) =d∏i=1

exp(yiwTxi)

exp(wTxi) + exp(−wTxi)

∝ exp

(d∑i=1

yiwTxi

),

which is a UGM with no edges,

φi(yi) = exp(yiwTxi),

so given the xi there is no dependence between the yi.



Adding an Ising-like term to model dependencies between yi gives

p(y1, y2, . . . , yd|x1, x2, . . . , xd) = 1

Zexp

d∑i=1

yiwTxi +∑

(i,j)∈E

yiyjv

,

Now we have the same “good” logistic regression model,but v controls how strongly we want neighbours to be the same.

Note that we’re going to jointly learn w and v.



We got a bit more fancy and used edge features xij ,

p(y1, y2, . . . , yd|x1, x2, . . . , xd) = 1

Zexp

d∑i=1

yiwTxi +∑

(i,j)∈E

yiyjvTxij

.

For example, we could use xij = 1/(1 + |xi − xj |).Encourages yi and yj to be more similar if xi and xj are more similar.

This is a pairwise UGM with

φi(yi) = exp(yiwTxi), φij(y

i, yj) = exp(yiyjvTxij).


Conditional Log-Linear Models

All these CRFs can be written as conditional log-linear models,

p(y|x,w) = 1

Zexp(wTF (x, y)),

for some parameters w and features F (x, y).

The NLL is convex and has the form

− log p(y|x,w) = −wTF (x, y) + logZ(x),

and the gradient can be written as

−∇ log p(y|x,w) = −F (x, y) + Ey|x[F (x, y)].

Unlike before, we now have a Z(x) and marginals for each x.Trained using gradient methods like quasi-Newton, SG, or SAG.


Modeling OCR Dependencies

What dependencies should we model for this problem?

φ(yi, xi): potential of individual letter given image.φ(yi−1, yi): dependency between adjacent letters (‘q-u’).φ(yi−1, yi, xi−1, xi): adjacent letters and image dependency.φi(y

i−1, yi): inhomogeneous dependency (French: ‘e-r’ ending).φi(y

i−2, yi−1, yi): third-order and inhomogeneous (English: ‘i-n-g’ end).φ(y ∈ D): is y in dictionary D?


Tractability of Discriminative Models

If the yi graph is a tree, we can easily fit CRFs.

But there are other cases where we can fit conditional log-linear models.

“Dictionary” feature is non-Markov, but exact computation still easy.We can use pseudo-likelihood or approximate inference.

Some other cases where exact computation is possible:

Semi-Markov chains (allow dependence on time you spend in a state).Context-free grammars (allows potentials on recursively-nested parts of sequence).Sum-product networks (restrict potentials to allow exact computation).


Outline


Learning for Structured Prediction

3 types of classifiers discussed in CPSC 340/540:

Model “Classic ML” Structured Prediction

Generative model p(y, x) Naive Bayes, GDA UGM (or MRF)Discriminative model p(y|x) Logistic regression CRF

Discriminant function y = f(x) SVM Structured SVM

Discriminaitve models don’t need to model x.

Discriminant functions don’t worry about probabilities.

Based on decoding, which is different than inference in structured case.


SVMs and Likelihood Ratios

Logistic regression optimizes a likelihood of the form

p(yi|xi, w) ∝ exp(yiwTxi).

But if we only want correct decisions it’s sufficient

p(yi|xi, w)p(−yi|xi, w)

≥ κ,

for any κ > 1.

Taking logarithms and plugging in probabilities gives

yiwTxi + logZ − (−yiwTxi)− logZ ≥ log κ

Since κ is arbitrary let’s use log(κ) = 2,

yiwTxi ≥ 1.


SVMs and Likelihood Ratios

So to classify all i correctly it’s sufficient that

yiwTxi ≥ 1,

but this linear program may have no solutions.

To give solution, allow non-negative “slack” ri and penalize size of ri,

argminw,r

n∑i=1

ri with yiwTxi ≥ 1− ri and ri ≥ 0.

If we apply our Day 2 linear programming trick in reverse this minimizes

f(w) =

n∑i=1

[1− yiwTxi]+

and adding an L2-regularizer gives the standard SVM objective.The notation [α]+ means max{0, α}.


Multi-Class SVMs: nk-Slack Formulation

With multi-class logistic regression we use

p(yi = c|xi, w) ∝ exp(wTc xi).

If want correct decisions it’s sufficient for all y′ 6= yi that

p(yi|xi, w)p(y′|xi, w)

≥ κ.

Following the same steps as before, this corresponds to

wTyixi − wTy′xi ≥ 1.

Adding slack variables our linear programming trick gives

f(W ) =

n∑i=1

∑y′ 6=yi

[1− wTyixi + wTy′x

i]+,

which with L2-regularization we’ll call the nk-slack multi-class SVM.


Multi-Class SVMs: n-Slack Formulation

If want correct decisions it’s also sufficent that

p(yi|xi, w)maxy′ 6=yi p(y

′|xi, w).

This leads to the constraints

maxy′ 6=yi{wTyix

i − wTy′xi} ≥ 1.

Following the same steps gives an alternate objective

f(W ) =n∑i=1

maxy′ 6=yi


i]+,

which with L2-regularization we’ll call the n-slack multi-class SVM.


Multi-Class SVMs: nk-Slack vs. n-Slack

Our two formulations of multi-class SVMs:

f(W ) =

n∑i=1

∑y′ 6=yi


i]+ +λ

2‖W‖2F ,

f(W ) =

n∑i=1

maxy′ 6=yi


i]+ +λ

2‖W‖2F .

The nk-slack loss penalizes based on all y′ that could be confused with yi.

The n-slack loss only penalizes based on the “most confusing” alternate example.

While nk-slack often works better, n-slack can be used for structured prediction...


Hidden Markov Support Vector Machines

For decoding in conditional random fields to entire labeling correct we need

p(yi|xi, w)p(y′|xi, w)

≥ γ,

for all alternative configuraitons y′.

Following the same steps are before we obtain

f(w) =

n∑i=1

maxy′ 6=y

[1− log p(yi|xi, w) + log p(y′|xi, w)]+ +λ

2‖w‖2,

the hidden Markov support vector machine (HMSVM).

Tries to make log-probability of true yi greater than for other y′ by more than 1.


Hidden Markov Support Vector Machines

Two problems with the HMSVM:1 It requires finding second-best decoding, which is harder than decoding.2 It views any alternative labeling y′ as equally bad.

Suppose that yi =[1 1 1 1

], and predictions of two models are

y′ =[1 1 0 1

], y′ =

[0 0 0 0

],

should both models receive the same loss on this example?


Adding a Loss Function

We can fix both HMSVM issues by replacing the “correct decision” constraint,

log p(yi|xi, w)− log p(y′|xi, w) ≥ 1,

with a constraint containing a loss function g,

log p(yi|xi, w)− log p(y′|xi, w) ≥ g(yi, y′).

Usually we take g(yi, y′) to be the difference between yi and y′.

If g(yi, yi) = 0, you can maximize over all y′ instead of y′ 6= yi.

Further, if g is written as sum of functions depending on the graph edges, finding“most violated” constraint is equivalent to decoding.


Structure SVMs

These constraints lead to the max-margin Markov network objective,

f(w) =

n∑i=1

maxy′

[g(yi, y′)− log p(yi|xi, w) + log p(y′|xi, w)]+ +λ

2‖w‖2,

which is also known as a structured SVM.

Beyond learning principle, key differences between CRFs and SSVMs:SSVMs require decoding, not inference, for learning:

Exact SSVMs in cases like graph cuts, matchings, rankings, etc.

SSVMs have loss function for complicated accuracy measures:But loss needs to decompose over parts for tractability.Could also formulate ‘loss-augmented’ CRFs.

We can also train with approximate decoding methods.State of the art training: block-coordinate Frank Wolfe (bonus slides).


Summary

Log-linear models are the most common UGM when learning parameters.

Structure learning is the problem of learning the graph structure.

Hard in general, but L1-regularization gives a fast heuristic.

Conditional log-linear models are the most common CRF models.

But you can fit some non-Markov models too.

Structured SVMs are a generalization of SVMs to structured prediction.

Only require decoding instead of inference.

Next time: convolutional neural networks.


Bonus Slide: SVMs for Ranking with Pairwise Preference

Suppose we want to rank examples.

A common setting is with features xi and pairwise preferences:List of objects (i, j) where we want yi > yj .

Assuming a log-linear model,

p(yi|xi, w) ∝ exp(wTxi),

we can derive a loss function based on the pairwise preference decisiosn,

p(yi|xi, w)p(yj |xj , w)

≥ γ,

which gives a loss function of the form

f(w) =∑

(i,j)∈R

[1− wTxi + wTxj ]+.


Bonus Slide: Fitting Structured SVMsOverview of progress on training SSVMs:

Cutting plane and bundle methods (e.g., svmStruct software):

Require O(1/ε) iterations.Each iteration requires decoding on every training example.

Stochastic sub-gradient methods:

Each iteration requires decoding on a single training example.Still requires O(1/ε) iterations.Need to choose step size.

Dual Online exponentiated gradient (OEG):

Allows line-search for step size and has O(1/ε) rate.Each iteration requires inference on a single training example.

Dual block-coordinate Frank-Wolfe (BCFW):

Each iteration requires decoding on a single training example.Requires O(1/ε) iterations.Closed-form optimal step size.Theory allows approximate decoding.


Bonus Slide: Block Coordinate Frank WolfeKey ideas behind BCFW for SSVMs:

Dual problem has as the form

minαi∈Mi

F (α) = f(Aα)−∑i

fi(αi).

where f is smooth.Problem structure where we can use block coordinate descent:

Normal coordinate updates intractable because αi ∈ |Y|.But Frank-Wolfe block-coordinate update is equivalent to decoding

s = argmins′∈Mi

F (α) + 〈∇iF (α), s′ − αi〉.

αi = αi − γ(s− αi).

Can implement algorithm in terms of primal variables.

Connections between Frank-Wolfe and other algorithms:Frank-Wolfe on dual problem is subgradient step on primal.‘Fully corrective’ Frank-Wolfe is equivalent to cutting plane.

cpsc 540: machine learningschmidtm/courses/540-w17/l18.pdfcpsc 540: machine learning structure...

Documents