vene.ro · differentiablesparsestructuredprediction (niculaeandmartins,2020) tree budget budget...

Learning with Sparse Latent StructureVlad Niculae

Instituto de Telecomunicações

Work with: André Martins, Claire Cardie, Mathieu Blondel

github.com/deep-spin/lp-sparsemap @vnfrombucharest https://vene.ro

https://github.com/deep-spin/lp-sparsemap

https://twitter.com/vnfrombucharest

https://vene.ro

Rich Underlying Structure

title

author

date

body

/ /segmentation:sentences,words,and so on

entities

relationshipse.g., dependency

Most of this structure is hidden.

2

Rich Underlying StructureWidely occuring pattern!

speech(Andre‐Obrecht, 1988)

objects(Long et al., 2015)

transition graphs(Kipf, Pol, et al., 2020)

But we’ll focus on NLP.

3

Structured Prediction

VERB PREP NOUNdog on wheels

NOUN PREP NOUNdog on wheels

NOUN DET NOUNdog on wheels

· · ·

⋆ dog on wheels

⋆ dog on wheels

⋆ dog on wheels

· · ·

dogon

wheels

hondopwielen

dogon

wheels

hondopwielen

dogon

wheels

hondopwielen

4

Structured Prediction· · ·

· · ·5

Traditional Pipeline Approach

input

pretrained parser

output

positive

neutral

negative

6

Deep Learning & Hidden Representations

input

dense vector

output

positive

neutral

negative

7

Latent Structure Models

input

· · ·

· · ·

output

positive

neutral

negative

8

*record scratch*

*freeze frame*

How to select an itemfrom a set?

9

How to select an item from a set?

· · ·

10


c1c2

· · ·cN

θ p

2

4

−1

1

−3

0

1

0

0

0

inputx

outputy

θ = f(x;w) y = g(p, x;w)

∂y∂w=? or, essentially, ∂p

∂θ=?

11


c1c2

· · ·cN

θ

p

2

4

−1

1

−3

0

1

0

0

0

inputx

outputy

θ = f(x;w) y = g(p, x;w)


∂θ=?

11


c1c2

· · ·cN

θ p

2

4

−1

1

−3

0

1

0

0

0

inputx

outputy

θ = f(x;w) y = g(p, x;w)


∂θ=?

11


c1c2

· · ·cN

θ p

2

4

−1

1

−3

0

1

0

0

0

inputx

outputy

θ = f(x;w) y = g(p, x;w)

∂y∂w=?

or, essentially, ∂p∂θ=?

11


c1c2

· · ·cN

θ p

2

4

−1

1

−3

0

1

0

0

0

inputx

outputy

θ = f(x;w) y = g(p, x;w)


∂θ=?

11

Argmaxθ

c1c2

· · ·cN

θ pp

∂p∂θ=?

12

Argmax

c1c2

· · ·cN

θ p

∂p∂θ = 0

p1

θ10

1

θ2 − 1 θ2 θ2 + 1

13

Argmax vs. Softmax

c1c2

· · ·cN

θ p

∂p∂θ = diag(p) − pp

⊤

p1

θ10

1

θ2 − 1 θ2 θ2 + 1

pj = exp(θj)/Z

14

A Softmax Origin Story

= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 3

15


= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

p = [1,0]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 3

15


= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

p = [1,0]

p = [0,1]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 3

15


= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

p = [1,0]

p = [0,1]

p = [1/2, 1/2]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 3

15


= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

p = [1,0]

p = [0,1]

p = [1/2, 1/2]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 315


= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

p = [1,0]

p = [0,1]

p = [1/2, 1/2]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

p = [0,1,0]

N = 315


= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

p = [1,0]

p = [0,1]

p = [1/2, 1/2]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

p = [0,1,0]

p = [0,0,1]

N = 315


= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

p = [1,0]

p = [0,1]

p = [1/2, 1/2]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

p = [0,1,0]

p = [0,0,1]

p = [1/3, 1/3, 1/3]

N = 315

A Softmax Origin Storymaxjθj =max

p∈p⊤θ Fundamental Thm. Lin. Prog.

(Dantzig et al., 1955)

0.5 1 1.5

0.5

1

1.5

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 315




0.5 1 1.5

0.5

1

1.5θ = [.2,1.4]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 315




0.5 1 1.5

0.5

1

1.5θ = [.2,1.4]

p⋆ = [0,1]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 315




0.5 1 1.5

0.5

1

1.5θ = [.2,1.4]

p⋆ = [0,1]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5θ = [.7, .1,1.5]

N = 315




0.5 1 1.5

0.5

1

1.5θ = [.2,1.4]

p⋆ = [0,1]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5θ = [.7, .1,1.5]

p⋆ = [0,0,1]

N = 315

Smoothed Max Operators

πΩ(θ) = argmaxp∈

p⊤θ −Ω(p)

argmax: Ω(p)=0 (no smoothing)

softmax: Ω(p)=∑

j pj log pj

sparsemax: Ω(p)= 1/2∥p∥22α‐entmax: Ω(p)= 1/α(α−1)

∑j pαi

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

(Niculae and Blondel, 2017)

16



p⊤θ −Ω(p)


softmax: Ω(p)=∑

j pj log pj

sparsemax: Ω(p)= 1/2∥p∥22

α‐entmax: Ω(p)= 1/α(α−1)∑

j pαi

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]


(Martins and Astudillo, 2016) 16

Smoothed Max OperatorsπΩ(θ) = argmax

p∈p⊤θ −Ω(p)


softmax: Ω(p)=∑

j pj log pj


∑j pαi

Tsallis (1988); a generalized entropy (Grünwald and Dawid, 2004)(Blondel, Martins, and Niculae 2019a;Peters, Niculae, and Martins 2019;Correia, Niculae, and Martins 2019)

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]


16

softmax

17

sparsemax

18

fusedmax ?!

19



p⊤θ −Ω(p)


softmax: Ω(p)=∑

j pj log pj


∑j pαi

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]


20



p⊤θ −Ω(p)


softmax: Ω(p)=∑

j pj log pj


∑j pαi

fusedmax: Ω(p)= 1/2∥p∥22 +∑

j |pj − pj−1|

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]


20



p⊤θ −Ω(p)


softmax: Ω(p)=∑

j pj log pj


∑j pαi

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]


21



p⊤θ −Ω(p)


softmax: Ω(p)=∑

j pj log pj


∑j pαi

fusedmax: Ω(p)= 1/2∥p∥22 +∑

j |pj − pj−1|

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]


21

Structured Predictionfinally

Structured Predictionis essentially a (very high‐dimensional) argmax

c1c2

· · ·cN

θ pinputx

outputyc2

There are exponentiallymany structures(θ cannot fit in memory!)

22

Structured Predictionis essentially a (very high‐dimensional) argmax

· · ·

θ pinputx

outputy

There are exponentiallymany structures(θ cannot fit in memory!)

22

Factorization Into Partsθ =A⊤η

⋆ dog on wheels

A =

⋆→dog 1 0 0on→dog 0 1 1

wheels→dog 0 0 0⋆→on 0 1 1

dog→on 1 ... 0 0 ...wheels→on 0 0 0

⋆→wheels 0 0 0dog→wheels 0 1 0on→wheels 1 0 1

η =

.1

.2−.1.3.8.1−.3.2−.1

23


⋆ dog on wheels

A =

⋆→dog 1 0 0on→dog 0 1 1


dog→on 1 ... 0 0 ...wheels→on 0 0 0


η =

.1

.2−.1.3.8.1−.3.2−.1

*→

dog→

on→

wheels→

dog on wheels

ay = [010 100 001]

23


⋆ dog on wheels

A =

⋆→dog 1 0 0on→dog 0 1 1


dog→on 1 ... 0 0 ...wheels→on 0 0 0


η =

.1

.2−.1.3.8.1−.3.2−.1

*→

dog→

on→

wheels→

dog on wheels

TREE

ay = [010 100 001]

23


⋆ dog on wheels

A =

⋆→dog 1 0 0on→dog 0 1 1


dog→on 1 ... 0 0 ...wheels→on 0 0 0


η =

.1

.2−.1.3.8.1−.3.2−.1

dogon

wheels

hondopwielen

A =

dog—hond 1 0 0dog—op 0 1 1dog—wielen 0 0 0on—hond 0 0 0on—op 1 ... 0 0 ...on—wielen 0 1 1

wheels—hond 0 1 0wheels—op 0 0 0wheels—wielen 1 0 1

η =

.1

.2−.1.3.8.1−.3.2−.1

23

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)SparseMAP argmax

μ∈Mμ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres

e.g. sequence labeling→ forward‐backward(Rabiner, 1989)

As attention: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)

As attention: (Liu and Lapata, 2018)

e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)

M := convah : h ∈H

=Ap : p ∈

=EH∼p aH : p ∈

24

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)


p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η


μ⊤η + eH(μ)

SparseMAP argmaxμ∈M

μ⊤η − 1/2∥μ∥2

M








=Ap : p ∈

=EH∼p aH : p ∈

24

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)


p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η


μ⊤η + eH(μ)SparseMAP argmax

μ∈Mμ⊤η − 1/2∥μ∥2

M







(Niculae, Martins, Blondel, and Cardie, 2018)


=Ap : p ∈

=EH∼p aH : p ∈

24

Algorithms for SparseMAPμ⋆ = argmax

μ∈Mμ⊤η − 1/2∥μ∥2

Conditional Gradient(Frank and Wolfe, 1956; Lacoste‐Julien and Jaggi, 2015)

• select a new corner ofM• update the (sparse) coefficients of p

• Update rules: vanilla, away‐step, pairwise• Quadratic objective: Active Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Martins, Figueiredo, et al., 2015)

Backward pass

∂μ∂η is sparse

computing∂μ∂η

⊤dy

takes O(dim(μ) nnz(p⋆))

quadratic objectivelinear constraints(alas, exponentially many!)

ay⋆ = argmaxμ∈M

μ⊤ (η −μ(t−1))︸︷︷︸eηActive Set achieves

finite & linear convergence!

Completely modular: just add MAP

25


μ∈Mμ⊤η − 1/2∥μ∥2


• select a new corner ofM

• update the (sparse) coefficients of p


Backward pass


computing∂μ∂η

⊤dy







25


μ∈Mμ⊤η − 1/2∥μ∥2


• select a new corner ofM

• update the (sparse) coefficients of p


Backward pass


computing∂μ∂η

⊤dy




μ⊤ (η −μ(t−1))︸︷︷︸eη

Active Set achievesfinite & linear convergence!


25


μ∈Mμ⊤η − 1/2∥μ∥2


• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away‐step, pairwise

• Quadratic objective: Active Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Martins, Figueiredo, et al., 2015)

Backward pass


computing∂μ∂η

⊤dy







25


μ∈Mμ⊤η − 1/2∥μ∥2


• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away‐step, pairwise• Quadratic objective: Active Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Martins, Figueiredo, et al., 2015)

Backward pass


computing∂μ∂η

⊤dy







25


μ∈Mμ⊤η − 1/2∥μ∥2



Backward pass


computing∂μ∂η

⊤dy




μ⊤ (η −μ(t−1))︸︷︷︸eη

Active Set achievesfinite & linear convergence!


25


μ∈Mμ⊤η − 1/2∥μ∥2



Backward pass


computing∂μ∂η

⊤dy







25

SparseMAP Applications

• Sparse alignment attention (more later)(Niculae, Martins, Blondel, and Cardie, 2018)

• Latent TreeLSTM(Niculae, Martins, and Cardie, 2018)

• As loss: supervised dependency parsing(Niculae, Martins, Blondel, and Cardie 2018;Blondel, Martins, and Niculae 2019b)

26

Latent Dependency TreesListOps (Nangia and Bowman, 2018)

Arity tagging with latent GCN (Corro and Titov, 2019; Kipf and Welling, 2017)

(max 2 9 (min 4 7 ) 0 )

27

Latent Dependency TreesListOps (Nangia and Bowman, 2018)

Arity tagging with latent GCN (Corro and Titov, 2019; Kipf and Welling, 2017)

(max 2 9 (min 4 7 ) 0 )4 - - 2 - - - - -

27

0 20 40 60 80 100

0.2

0.4

0.6

0.8

1

epoch

validationF1score Gold tree

Left‐to‐right

28

0 20 40 60 80 100

0.2

0.4

0.6

0.8

1

epoch


Latent treeLeft‐to‐right

28

What if MAP is notavailable?

Multiple, Overlapping Factors

Maximization in factor graphs: NP‐hard, even when each factor is tractable.

⋆ dog on wheels

*→

dog→

on→

wheels→

TREE

BUDGET

BUDGET

BUDGET

BUDGET

29

Multiple, Overlapping Factors

Maximization in factor graphs: NP‐hard, even when each factor is tractable.

⋆ dog on wheels

*→

dog→

on→

wheels→TREE

BUDGET

BUDGET

BUDGET

BUDGET

29

Multiple, Overlapping FactorsMaximization in factor graphs: NP‐hard, even when each factor is tractable.

⋆ dog on wheels

*→

dog→

on→

wheels→TREE

BUDGET

BUDGET

BUDGET

BUDGET

29

Optimization as Consensus-Seeking

μ[1:3]

μ[4:6]

μ[7:9]

fa

fb

μa

μb

Agreement on overlap: μa,[4:6] =μb,[4:6] =μ[4:6]

maxμf

∑f∈F

η⊤f μf s.t.

Cfμ =μf,

μf ∈Mf for f ∈ F

LP relaxation (Wainwright and Jordan, 2008)

the local polytope:L :=μ : Cfμ ∈Mf, f ∈ F ⊇M

M

L

(Niculae and Martins, 2020)

30


μ[1:3]

μ[4:6]

μ[7:9]

fa

fb

μa

μb


maxμ,μf

∑f∈F

η⊤f μf s.t. Cfμ =μf, μf ∈Mf for f ∈ F



M

L


30


μ[1:3]

μ[4:6]

μ[7:9]

fa

fb

μa

μb


maxμ,μf

∑f∈F

η⊤f μf− 1/2∥μ∥2 s.t. Cfμ =μf, μf ∈Mf for f ∈ F



M

L


30

Algorithms for LP-SparseMAP(Niculae and Martins, 2020)

Forward pass

argmaxCfμ=μf

∑f∈F

η⊤f μf− 1/2∥μ∥2

=argmaxCf μ=μf

∑f∈Fη⊤f μf − 1/2∥Dfμf∥2

• Separable objective,agreement constraintsADMM in consensus form

• SparseMAP subproblem for each f

Backward pass

• Sparse fixed‐point iteration

• Combines the SparseMAP Jacobiansof each factor

31

Differentiable Sparse Structured Prediction(Niculae and Martins, 2020)

TREE

BUDGET

BUDGET

BUDGET

BUDGET

Factor graphs as a hidden‐layer DSL!

If |F | = 1, recovers SparseMAP.Modular library.Built‐in specialized factors:• OR, XOR, AND• OR‐with‐output• Budget, Knapsack• Pairwise

New factors only require MAP.

fg = FactorGraph()var = [fg.variable() for i =/ j] # handwave

fg.add(Tree(var))

for i in range(n):fg.add(Budget(var[i, :], budget=5)

μ = fg.lp_sparsemap(η)

32


TREE

BUDGET

BUDGET

BUDGET

BUDGET

Factor graphs as a hidden‐layer DSL!If |F | = 1, recovers SparseMAP.

Modular library.Built‐in specialized factors:• OR, XOR, AND• OR‐with‐output• Budget, Knapsack• Pairwise


fg = FactorGraph()var = [fg.variable() for i =/ j] # handwave

fg.add(Tree(var))

for i in range(n):fg.add(Budget(var[i, :], budget=5)

μ = fg.lp_sparsemap(η)

32


TREE

BUDGET

BUDGET

BUDGET

BUDGET

Factor graphs as a hidden‐layer DSL!If |F | = 1, recovers SparseMAP.Modular library.Built‐in specialized factors:• OR, XOR, AND• OR‐with‐output• Budget, Knapsack• Pairwise


class Factor:def map(ηf): # abstract, private

raise NotImplemented

def sparsemap(ηf):# active set algo, uses self.map

def backward(dμf):# analytic, uses active set result

class Budget(Factor):def sparsemap(ηf):

# specialized

def backward(dμf):# specialized

32


TREE

BUDGET

BUDGET

BUDGET

BUDGET

Factor graphs as a hidden‐layer DSL!If |F | = 1, recovers SparseMAP.Modular library.Built‐in specialized factors:• OR, XOR, AND• OR‐with‐output• Budget, Knapsack• Pairwise


class Factor:def map(ηf): # abstract, private

raise NotImplemented

def sparsemap(ηf):# active set algo, uses self.map

def backward(dμf):# analytic, uses active set result

class Budget(Factor):def sparsemap(ηf):

# specialized

def backward(dμf):# specialized

class Tree(Factor):def map(η):

# Chu-Liu/Edmonds algo32

0 20 40 60 80 100

0.2

0.4

0.6

0.8

1

epoch


Latent treeLeft‐to‐right

33

0 20 40 60 80 100

0.2

0.4

0.6

0.8

1

epoch


Latent w/ Budget(5)Latent treeLeft‐to‐right

33

Structured Attention for AlignmentsNLI premise: A gentleman overlooking a neighborhood situation.

hypothesis: A police officer watches a situation closely.

input

(P, H)

A

gentleman

overlooking

...

situation

A

police

officer...

closely

output

entails

contradicts

neutral

(Model: decomposable attention (Parikh et al., 2016))34

Structured Attention for AlignmentsNLI premise: A gentleman overlooking a neighborhood situation.

hypothesis: A police officer watches a situation closely.

input

(P, H)

A

gentleman

overlooking

...

situation

A

police

officer...

closely

output

entails

contradicts

neutral

(Proposed model: global structured alignment.)34

Structured Alignment Modelsmatching

SparseMAP w/ Kuhn‐Munkres(Kuhn, 1955)

LP‐matching

LP‐SparseMAP w/ XORs(equivalent; different solver)

LP‐sequence

additional scorefor contiguous alignments

(i, j) − (i + 1, j± 1)

35

Structured Alignment Modelsmatching

SparseMAP w/ Kuhn‐Munkres(Kuhn, 1955)

LP‐matching

LP‐SparseMAP w/ XORs(equivalent; different solver)

LP‐sequence

additional scorefor contiguous alignments

(i, j) − (i + 1, j± 1)35

MultiNLI (Williams et al., 2017)

softmax matching LP‐matching LP‐sequence66%

68%

70%

72%

36

a

gentleman

overlooking

a

neighborhood

situation

.

a

police

officer

watches a

situation

closely . a

police

officer

watches a

situation

closely .

37

a

police

officer

watches a

situation

closely .

a

gentleman

overlooking

a

neighborhood

situation

.

A

gentleman

overlooking

a

neighborhood

situation

.

A

police

officer

watches

a

situation

closely

.

38

ConclusionsDifferentiable & sparsestructured inference

Generic, extensible, efficient algorithms

Interpretable structured attention

Future workStructure beyond NLP

Weak & semi‐supervision

Generative latent structure models

[email protected] github.com/deep-spin/lp-sparsemaphttps://vene.ro @vnfrombucharest

· · ·

· · ·

a

police

officer

watches a

situation

closely .

a

gentleman

overlooking

a

neighborhood

situation

.

TREE

BUDGET

BUDGET

BUDGET

BUDGET

mailto:[email protected]

https://github.com/deep-spin/lp-sparsemap

https://vene.ro

https://twitter.com/vnfrombucharest

Extra slides

Acknowledgements

This work was supported by the European Research Council (ERC StG DeepSPIN 758969) and by theFundação para a Ciência e Tecnologia through contract UID/EEA/50008/2013.

Some icons by Dave Gandy and Freepik via flaticon.com.

40

https://www.flaticon.com/authors/dave-gandy

https://www.freepik.com/

https://www.flaticon.com/

Sparsemaxsparsemax(θ) = argmax

p∈p⊤θ − 1/2∥p∥22

= argminp∈∥p − θ∥22

Computation:

p⋆ = [θ − τ1]+θi > θj⇒ pi ≥ pjO(d) via partial sort

(Held et al., 1974; Brucker, 1984; Condat, 2016)

Backward pass:

Jsparsemax = diag(s) − 1|S|ss⊤

where S = j : p⋆j > 0,sj = Jj ∈ SK

(Martins and Astudillo, 2016)

argmin differentiation(Gould et al., 2016; Amos and Kolter, 2017)

41

Danskin’s Theorem(Danskin, 1966; Prop. B.25 in Bertsekas, 1999)

Let φ : Rd × Z→ R,Z ⊂ Rd compact.∂max

z∈Z φ(x, z) = conv∇xφ(x, z⋆) | z⋆ ∈ argmaxz∈Z

φ(x, z).

Example: maximum of a vector

∂maxj∈[d] θj = ∂max

p∈ p⊤θ

= ∂maxp∈ φ(p,θ)

= conv∇θφ(p⋆,θ)= convp⋆

θ = [t,0]

t0

1

−1 0 +1maxj θj

t0

1

−1 0 +1g1 | g ∈ ∂maxj θj

43

Dynamically inferringthe computation graph

So far: a structured hidden layerEH[aH]

Network must handle “soft” combinations of structures.Fine for attention, but can be limiting.

Dependency TreeLSTM

The bears eat the pretty ones

(Tai et al., 2015)

45

Latent Dependency TreeLSTM

input

x

p(y|x) =∑h∈H

p(y | h, x) p(h | x)

h ∈H

The bears eat the pretty ones

output

y

(Niculae, Martins, and Cardie, 2018)

46

Latent Dependency TreeLSTM

input

x

p(y|x) =∑h∈H

p(y | h, x) p(h | x)

h ∈HThe bears eat the pretty ones

output

y

(Niculae, Martins, and Cardie, 2018)

46

Structured Latent Variable Models

p(y | x) = ∑h∈H

p

φ

(y | h, x) p

π

(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp

scoreπ(h; x)

softmaxidea 3 SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponentially large sum!

47


p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)


∂p(y | x)∂π


scoreπ(h; x)





47


p(y | x) = ∑h∈H



∂p(y | x)∂π


scoreπ(h; x)


e.g., a TreeLSTM defined by h

sum overall possible trees



47


p(y | x) = ∑h∈H



∂p(y | x)∂π


scoreπ(h; x)





47


p(y | x) = ∑h∈H


How to define pπ?

∑h∈H

∂p(y | x)∂π

idea 1

pπ(h | x) = 1 if h = h⋆ else 0 argmax

idea 2

pπ(h | x) ∝ expscoreπ(h; x)

softmax

idea 3

SparseMAP




47


p(y | x) = ∑h∈H



∂p(y | x)∂π

idea 1

pπ(h | x) = 1 if h = h⋆ else 0 argmax

idea 2


softmax

idea 3

SparseMAP




47


p(y | x) = ∑h∈H



∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2


softmax

idea 3

SparseMAP




47


p(y | x) = ∑h∈H



∂p(y | x)∂π


scoreπ(h; x)

softmaxidea 3

SparseMAP




47


p(y | x) = ∑h∈H



∂p(y | x)∂π


scoreπ(h; x)





47

SparseMAP

• • •= .7 • • • + .3 • • •

+ 0 • • • + ...p(y | x)= .7 pφ(y | • • •)+ .3 pφ(y | • • • )

48

SparseMAP

• • •= .7 • • • + .3 • • • + 0 • • • + ...

p(y | x)= .7 pφ(y | • • •)+ .3 pφ(y | • • • )

48

SparseMAP

• • •= .7 • • • + .3 • • • + 0 • • • + ...p(y | x)= .7 pφ(y | • • •)+ .3 pφ(y | • • • )

48

Sentiment classification (SST)

80%

81%

82%

83%

84%

85%

Natural Language Inference (SNLI)

LTR Flat CoreNLP Latent

80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%

82%accuracy(3‐class)

Reverse dictionary lookup(definitions)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

⋆ The bears eat the pretty ones

Left‐to‐right: regular LSTM


Flat: bag‐of‐words–like


CoreNLP: off‐line parser

Sentence pair classification (P,H)

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word description, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =

∑h∈H

g(x;h) pπ(h | x)


LTR80%

81%

82%

83%

84%

85%



80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%



LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10








p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)


∑h∈H

g(x;h) pπ(h | x)


LTR Flat80%

81%

82%

83%

84%

85%



80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%



LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10








p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)


∑h∈H

g(x;h) pπ(h | x)


LTR Flat CoreNLP80%

81%

82%

83%

84%

85%



80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%



LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10








p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)


∑h∈H

g(x;h) pπ(h | x)


LTR Flat CoreNLP Latent80%

81%

82%

83%

84%

85%



80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%



LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10








p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)


∑h∈H

g(x;h) pπ(h | x)



81%

82%

83%

84%

85%accuracy(binary)



80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%



LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10








p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)


∑h∈H

g(x;h) pπ(h | x)



81%

82%

83%

84%

85%accuracy(binary)



80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%



LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10








p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)

given word description, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =

∑h∈H

g(x;h) pπ(h | x)



81%

82%

83%

84%

85%accuracy(binary)



80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%


Reverse dictionary lookup

(definitions)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10








p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)

given word description, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =

∑h∈H

g(x;h) pπ(h | x)



81%

82%

83%

84%

85%accuracy(binary)



80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%



LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10








p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)


∑h∈H

g(x;h) pπ(h | x)

Syntax vs. Composition Order

p = 22.6%

⋆ lovely and poignant .

CoreNLP parse, p = 21.4%


· · ·

50

Syntax vs. Composition Orderp = 22.6%




· · · 50

Syntax vs. Composition Order

p = 22.6%




· · ·

p = 15.33%

⋆ a deep and meaningful film .1.0 1.0

1.01.0

1.0 1.0

p = 15.27%

⋆ a deep and meaningful film .

1.01.0

1.01.0

1.01.0

· · ·CoreNLP parse, p = 0%

⋆ a deep and meaningful film .

1.0 1.0

1.01.0

1.0

1.0

51

Structured Output Prediction

SparseMAP LA(η, μ) = maxμ∈Mη⊤μ − 1/2∥μ∥2

− η⊤μ + 1/2∥μ∥2

cost‐SparseMAP LρA(η, μ) = maxμ∈Mη⊤μ − 1/2∥μ∥2+ρ(μ, μ)

− η⊤μ + 1/2∥μ∥2

Instance of a structured Fenchel‐Young loss, like CRF, SVM, etc. (Blondel, Martins, and Niculae, 2019b)

52

Structured Output Prediction

SparseMAP LA(η, μ) = maxμ∈Mη⊤μ − 1/2∥μ∥2

− η⊤μ + 1/2∥μ∥2cost‐SparseMAP LρA(η, μ) = max

μ∈Mη⊤μ − 1/2∥μ∥2+ρ(μ, μ)

− η⊤μ + 1/2∥μ∥2

Instance of a structured Fenchel‐Young loss, like CRF, SVM, etc. (Blondel, Martins, and Niculae, 2019b)

52

References IAmos, Brandon and J. Zico Kolter (2017). “OptNet: Differentiable optimization as a layer in neural networks”. In: Proc. of ICML.

Andre‐Obrecht, Regine (1988). “A new statistical approach for the automatic segmentation of continuous speech signals”. In: IEEETransactions on Acoustics, Speech, and Signal Processing 36.1, pp. 29–40.

Bertsekas, Dimitri P (1999). Nonlinear Programming. Athena Scientific Belmont.

Blondel, Mathieu, André FT Martins, and Vlad Niculae (2019a). “Learning classifiers with Fenchel‐Young losses: Generalizedentropies, margins, and algorithms”. In: Proc. of AISTATS.

— (2019b). “Learning with Fenchel‐Young Losses”. In: preprint arXiv:1901.02324.

Brucker, Peter (1984). “An O(n) algorithm for quadratic knapsack problems”. In: Operations Research Letters 3.3, pp. 163–166.

Condat, Laurent (2016). “Fast projection onto the simplex and the ℓ1 ball”. In:Mathematical Programming 158.1‐2, pp. 575–585.

Correia, Gonçalo M., Vlad Niculae, and André FT Martins (2019). “Adaptively Sparse Transformers”. In: Proc. EMNLP.

Corro, Caio and Ivan Titov (2019). “Learning latent trees with stochastic perturbations and differentiable dynamic programming”.In: Proc. of ACL.

Danskin, John M (1966). “The theory of max‐min, with applications”. In: SIAM Journal on Applied Mathematics 14.4, pp. 641–664.

53

http://proceedings.mlr.press/v70/amos17a.html

http://www.athenasc.com/nonlinbook.html

https://arxiv.org/abs/1805.09717



https://www.sciencedirect.com/science/article/pii/0167637784900105

https://hal.archives-ouvertes.fr/hal-01056171



https://epubs.siam.org/doi/abs/10.1137/0114053

References IIDantzig, George B, Alex Orden, and Philip Wolfe (1955). “The generalized simplex method for minimizing a linear form underlinear inequality restraints”. In: Pacific Journal of Mathematics 5.2, pp. 183–195.

Frank, Marguerite and Philip Wolfe (1956). “An algorithm for quadratic programming”. In: Nav. Res. Log. 3.1‐2, pp. 95–110.

Gould, Stephen et al. (2016). “On differentiating parameterized argmin and argmax problems with application to bi‐leveloptimization”. In: preprint arXiv:1607.05447.

Grünwald, Peter D and A Philip Dawid (2004). “Game theory, maximum entropy, minimum discrepancy and robust Bayesiandecision theory”. In: Annals of Statistics, pp. 1367–1433.

Held, Michael, Philip Wolfe, and Harlan P Crowder (1974). “Validation of subgradient optimization”. In:Mathematical Programming6.1, pp. 62–88.

Hill, Felix et al. (2016). “Learning to understand phrases by embedding the dictionary”. In: TACL 4.1, pp. 17–30.

Kim, Yoon et al. (2017). “Structured attention networks”. In: Proc. of ICLR.

Kipf, Thomas, Elise van der Pol, and Max Welling (2020). “Contrastive Learning of Structured World Models”. In: Proc. of ICLR.

Kipf, Thomas and Max Welling (2017). “Semi‐supervised classification with graph convolutional networks”. In: Proc. of ICLR.

54

https://msp.org/pjm/1955/5-2/pjm-v5-n2-s.pdf

https://msp.org/pjm/1955/5-2/pjm-v5-n2-s.pdf

https://doi.org/10.1002/nav.3800030109



https://arxiv.org/abs/math/0410076

https://arxiv.org/abs/math/0410076

https://link.springer.com/article/10.1007/BF01580223




References IIIKoo, Terry et al. (2007). “Structured prediction models via the matrix‐tree theorem”. In: Proc. of EMNLP.

Kuhn, Harold W (1955). “The Hungarian method for the assignment problem”. In: Nav. Res. Log. 2.1‐2, pp. 83–97.

Lacoste‐Julien, Simon and Martin Jaggi (2015). “On the global linear convergence of Frank‐Wolfe optimization variants”. In: Proc.of NeurIPS.

Liu, Yang and Mirella Lapata (2018). “Learning structured text representations”. In: TACL 6, pp. 63–75.

Long, Jonathan, Evan Shelhamer, and Trevor Darrell (2015). “Fully convolutional networks for semantic segmentation”. In: Proc. ofCVPR.

Martins, André FT and Ramón Fernandez Astudillo (2016). “From softmax to sparsemax: A sparse model of attention andmulti‐label classification”. In: Proc. of ICML.

Martins, André FT, Mário AT Figueiredo, et al. (2015). “AD3: Alternating directions dual decomposition for MAP inference ingraphical models”. In: JMLR 16.1, pp. 495–545.

McDonald, Ryan T and Giorgio Satta (2007). “On the complexity of non‐projective data‐driven dependency parsing”. In: Proc. ofICPT.

Nangia, Nikita and Samuel Bowman (2018). “ListOps: A diagnostic dataset for latent tree learning”. In: Proc. of NAACL SRW.

55

http://www.aclweb.org/anthology/D07-1015

http://onlinelibrary.wiley.com/doi/10.1002/nav.3800020109/abstract





http://jmlr.org/papers/v16/martins15a.html

http://jmlr.org/papers/v16/martins15a.html

https://dl.acm.org/citation.cfm?id=1621410.1621426

https://www.aclweb.org/anthology/N18-4013

References IVNiculae, Vlad and Mathieu Blondel (2017). “A regularized framework for sparse and structured neural attention”. In: Proc. of

NeurIPS.Niculae, Vlad and André FT Martins (2020). “LP‐SparseMAP: Differentiable relaxed optimization for sparse structuredprediction”. In: preprint arXiv:2001.04437.

Niculae, Vlad, André FT Martins, Mathieu Blondel, et al. (2018). “SparseMAP: Differentiable sparse structured inference”. In: Proc.of ICML.

Niculae, Vlad, André FT Martins, and Claire Cardie (2018). “Towards dynamic computation graphs via sparse latent structure”. In:Proc. of EMNLP.

Nocedal, Jorge and Stephen Wright (1999). Numerical Optimization. Springer New York.

Parikh, Ankur et al. (2016). “A decomposable attention model for natural language inference”. In: Proc. of EMNLP.

Peters, Ben, Vlad Niculae, and André FT Martins (2019). “Sparse sequence‐to‐sequence models”. In: Proc. ACL.

Rabiner, Lawrence R. (1989). “A tutorial on Hidden Markov Models and selected applications in speech recognition”. In: P. IEEE77.2, pp. 257–286.

Smith, David A and Noah A Smith (2007). “Probabilistic models of nonprojective dependency trees”. In: Proc. of EMNLP.

56






https://doi.org/10.1007/b98874



https://doi.org/10.1109/5.18626

References VTai, Kai Sheng, Richard Socher, and Christopher D Manning (2015). “Improved semantic representations from tree‐structuredLong Short‐Term Memory networks”. In: Proc. of ACL‐IJCNLP.

Taskar, Ben (2004). “Learning structured prediction models: A large margin approach”. PhD thesis. Stanford University.

Tibshirani, Robert et al. (2005). “Sparsity and smoothness via the fused lasso”. In: Journal of the Royal Statistical Society: Series B(Statistical Methodology) 67.1, pp. 91–108.

Tsallis, Constantino (1988). “Possible generalization of Boltzmann‐Gibbs statistics”. In: Journal of Statistical Physics 52,pp. 479–487.

Valiant, Leslie G (1979). “The complexity of computing the permanent”. In: Theor. Comput. Sci. 8.2, pp. 189–201.

Wainwright, Martin J and Michael I Jordan (2008). Graphical models, exponential families, and variational inference. Vol. 1. 1–2. NowPublishers, Inc., pp. 1–305.

Williams, Adina, Nikita Nangia, and Samuel R Bowman (2017). “A broad‐coverage challenge corpus for sentence understandingthrough inference”. In: preprint arXiv:1704.05426.

Wolfe, Philip (1976). “Finding the nearest point in a polytope”. In:Mathematical Programming 11.1, pp. 128–149.

57



https://homes.cs.washington.edu/~taskar/pubs/thesis.pdf

https://web.stanford.edu/group/SOL/papers/fused-lasso-JRSSB.pdf


https://doi.org/10.1016/0304-3975(79)90044-6

https://people.eecs.berkeley.edu/~wainwrig/Papers/WaiJor08_FTML.pdf




vene.ro · differentiablesparsestructuredprediction (niculaeandmartins,2020) tree budget budget...

Documents