vene.ro · differentiablesparsestructuredprediction (niculaeandmartins,2020) tree budget budget...

200
Learning with Sparse Latent Structure Vlad Niculae Instuto de Telecomunicações Work with: André Marns, Claire Cardie, Mathieu Blondel github.com/deep-spin/lp-sparsemap @vnfrombucharest https://vene.ro

Upload: others

Post on 27-Jan-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Learning with Sparse Latent StructureVlad Niculae

Instituto de Telecomunicações

Work with: André Martins, Claire Cardie, Mathieu Blondel

github.com/deep-spin/lp-sparsemap @vnfrombucharest https://vene.ro

Page 2: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Rich Underlying Structure

title

author

date

body

/ /segmentation:sentences,words,and so on

entities

relationshipse.g., dependency

Most of this structure is hidden.

2

Page 3: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Rich Underlying Structure

title

author

date

body

/ /segmentation:sentences,words,and so on

entities

relationshipse.g., dependency

Most of this structure is hidden.

2

Page 4: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Rich Underlying Structure

title

author

date

body

/ /segmentation:sentences,words,and so on

entities

relationshipse.g., dependency

Most of this structure is hidden.

2

Page 5: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Rich Underlying Structure

title

author

date

body

/ /segmentation:sentences,words,and so on

entities

relationshipse.g., dependency

Most of this structure is hidden.

2

Page 6: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Rich Underlying Structure

title

author

date

body

/ /segmentation:sentences,words,and so on

entities

relationshipse.g., dependency

Most of this structure is hidden.

2

Page 7: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Rich Underlying Structure

title

author

date

body

/ /segmentation:sentences,words,and so on

entities

relationshipse.g., dependency

Most of this structure is hidden.

2

Page 8: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Rich Underlying StructureWidely occuring pattern!

speech(Andre‐Obrecht, 1988)

objects(Long et al., 2015)

transition graphs(Kipf, Pol, et al., 2020)

But we’ll focus on NLP.

3

Page 9: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Rich Underlying StructureWidely occuring pattern!

speech(Andre‐Obrecht, 1988)

objects(Long et al., 2015)

transition graphs(Kipf, Pol, et al., 2020)

But we’ll focus on NLP.

3

Page 10: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Structured Prediction

VERB PREP NOUNdog on wheels

NOUN PREP NOUNdog on wheels

NOUN DET NOUNdog on wheels

· · ·

⋆ dog on wheels

⋆ dog on wheels

⋆ dog on wheels

· · ·

dogon

wheels

hondopwielen

dogon

wheels

hondopwielen

dogon

wheels

hondopwielen

4

Page 11: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Structured Prediction

VERB PREP NOUNdog on wheels

NOUN PREP NOUNdog on wheels

NOUN DET NOUNdog on wheels

· · ·

⋆ dog on wheels

⋆ dog on wheels

⋆ dog on wheels

· · ·

dogon

wheels

hondopwielen

dogon

wheels

hondopwielen

dogon

wheels

hondopwielen

4

Page 12: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Structured Prediction· · ·

· · ·5

Page 13: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Traditional Pipeline Approach

input

pretrained parser

output

positive

neutral

negative

6

Page 14: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Traditional Pipeline Approach

input

pretrained parser

output

positive

neutral

negative

6

Page 15: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Deep Learning & Hidden Representations

input

dense vector

output

positive

neutral

negative

7

Page 16: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Deep Learning & Hidden Representations

input

dense vector

output

positive

neutral

negative

7

Page 17: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Latent Structure Models

input

· · ·

· · ·

output

positive

neutral

negative

8

Page 18: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

*record scratch*

*freeze frame*

How to select an itemfrom a set?

9

Page 19: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

How to select an item from a set?

· · ·

10

Page 20: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

How to select an item from a set?

c1c2

· · ·cN

θ p

2

4

−1

1

−3

0

1

0

0

0

inputx

outputy

θ = f(x;w) y = g(p, x;w)

∂y∂w=? or, essentially, ∂p

∂θ=?

11

Page 21: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

How to select an item from a set?

c1c2

· · ·cN

θ

p

2

4

−1

1

−3

0

1

0

0

0

inputx

outputy

θ = f(x;w) y = g(p, x;w)

∂y∂w=? or, essentially, ∂p

∂θ=?

11

Page 22: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

How to select an item from a set?

c1c2

· · ·cN

θ p

2

4

−1

1

−3

0

1

0

0

0

inputx

outputy

θ = f(x;w) y = g(p, x;w)

∂y∂w=? or, essentially, ∂p

∂θ=?

11

Page 23: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

How to select an item from a set?

c1c2

· · ·cN

θ p

2

4

−1

1

−3

0

1

0

0

0

inputx

outputy

θ = f(x;w) y = g(p, x;w)

∂y∂w=? or, essentially, ∂p

∂θ=?

11

Page 24: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

How to select an item from a set?

c1c2

· · ·cN

θ p

2

4

−1

1

−3

0

1

0

0

0

inputx

outputy

θ = f(x;w) y = g(p, x;w)

∂y∂w=?

or, essentially, ∂p∂θ=?

11

Page 25: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

How to select an item from a set?

c1c2

· · ·cN

θ p

2

4

−1

1

−3

0

1

0

0

0

inputx

outputy

θ = f(x;w) y = g(p, x;w)

∂y∂w=? or, essentially, ∂p

∂θ=?

11

Page 26: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Argmaxθ

c1c2

· · ·cN

θ pp

∂p∂θ=?

12

Page 27: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Argmaxθ

c1c2

· · ·cN

θ pp

∂p∂θ=?

12

Page 28: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Argmaxθ

c1c2

· · ·cN

θ pp

∂p∂θ=?

12

Page 29: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Argmaxθ

c1c2

· · ·cN

θ pp

∂p∂θ=?

12

Page 30: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Argmaxθ

c1c2

· · ·cN

θ pp

∂p∂θ=?

12

Page 31: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Argmaxθ

c1c2

· · ·cN

θ pp

∂p∂θ=?

12

Page 32: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Argmaxθ

c1c2

· · ·cN

θ pp

∂p∂θ=?

12

Page 33: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Argmax

c1c2

· · ·cN

θ p

∂p∂θ = 0

p1

θ10

1

θ2 − 1 θ2 θ2 + 1

13

Page 34: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Argmax vs. Softmax

c1c2

· · ·cN

θ p

∂p∂θ = diag(p) − pp

p1

θ10

1

θ2 − 1 θ2 θ2 + 1

pj = exp(θj)/Z

14

Page 35: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

A Softmax Origin Story

= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 3

15

Page 36: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

A Softmax Origin Story

= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 3

15

Page 37: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

A Softmax Origin Story

= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

p = [1,0]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 3

15

Page 38: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

A Softmax Origin Story

= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

p = [1,0]

p = [0,1]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 3

15

Page 39: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

A Softmax Origin Story

= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

p = [1,0]

p = [0,1]

p = [1/2, 1/2]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 3

15

Page 40: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

A Softmax Origin Story

= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

p = [1,0]

p = [0,1]

p = [1/2, 1/2]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 315

Page 41: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

A Softmax Origin Story

= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

p = [1,0]

p = [0,1]

p = [1/2, 1/2]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

p = [0,1,0]

N = 315

Page 42: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

A Softmax Origin Story

= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

p = [1,0]

p = [0,1]

p = [1/2, 1/2]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

p = [0,1,0]

p = [0,0,1]

N = 315

Page 43: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

A Softmax Origin Story

= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

p = [1,0]

p = [0,1]

p = [1/2, 1/2]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

p = [0,1,0]

p = [0,0,1]

p = [1/3, 1/3, 1/3]

N = 315

Page 44: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

A Softmax Origin Storymaxjθj =max

p∈p⊤θ Fundamental Thm. Lin. Prog.

(Dantzig et al., 1955)

0.5 1 1.5

0.5

1

1.5

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 315

Page 45: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

A Softmax Origin Storymaxjθj =max

p∈p⊤θ Fundamental Thm. Lin. Prog.

(Dantzig et al., 1955)

0.5 1 1.5

0.5

1

1.5θ = [.2,1.4]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 315

Page 46: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

A Softmax Origin Storymaxjθj =max

p∈p⊤θ Fundamental Thm. Lin. Prog.

(Dantzig et al., 1955)

0.5 1 1.5

0.5

1

1.5θ = [.2,1.4]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 315

Page 47: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

A Softmax Origin Storymaxjθj =max

p∈p⊤θ Fundamental Thm. Lin. Prog.

(Dantzig et al., 1955)

0.5 1 1.5

0.5

1

1.5θ = [.2,1.4]

p⋆ = [0,1]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 315

Page 48: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

A Softmax Origin Storymaxjθj =max

p∈p⊤θ Fundamental Thm. Lin. Prog.

(Dantzig et al., 1955)

0.5 1 1.5

0.5

1

1.5θ = [.2,1.4]

p⋆ = [0,1]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5θ = [.7, .1,1.5]

N = 315

Page 49: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

A Softmax Origin Storymaxjθj =max

p∈p⊤θ Fundamental Thm. Lin. Prog.

(Dantzig et al., 1955)

0.5 1 1.5

0.5

1

1.5θ = [.2,1.4]

p⋆ = [0,1]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5θ = [.7, .1,1.5]

N = 315

Page 50: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

A Softmax Origin Storymaxjθj =max

p∈p⊤θ Fundamental Thm. Lin. Prog.

(Dantzig et al., 1955)

0.5 1 1.5

0.5

1

1.5θ = [.2,1.4]

p⋆ = [0,1]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5θ = [.7, .1,1.5]

p⋆ = [0,0,1]

N = 315

Page 51: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Smoothed Max Operators

πΩ(θ) = argmaxp∈

p⊤θ −Ω(p)

argmax: Ω(p)=0 (no smoothing)

softmax: Ω(p)=∑

j pj log pj

sparsemax: Ω(p)= 1/2∥p∥22α‐entmax: Ω(p)= 1/α(α−1)

∑j pαi

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

(Niculae and Blondel, 2017)

16

Page 52: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Smoothed Max Operators

πΩ(θ) = argmaxp∈

p⊤θ −Ω(p)

argmax: Ω(p)=0 (no smoothing)

softmax: Ω(p)=∑

j pj log pj

sparsemax: Ω(p)= 1/2∥p∥22α‐entmax: Ω(p)= 1/α(α−1)

∑j pαi

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

(Niculae and Blondel, 2017)

16

Page 53: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Smoothed Max Operators

πΩ(θ) = argmaxp∈

p⊤θ −Ω(p)

argmax: Ω(p)=0 (no smoothing)

softmax: Ω(p)=∑

j pj log pj

sparsemax: Ω(p)= 1/2∥p∥22α‐entmax: Ω(p)= 1/α(α−1)

∑j pαi

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

(Niculae and Blondel, 2017)

16

Page 54: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Smoothed Max Operators

πΩ(θ) = argmaxp∈

p⊤θ −Ω(p)

argmax: Ω(p)=0 (no smoothing)

softmax: Ω(p)=∑

j pj log pj

sparsemax: Ω(p)= 1/2∥p∥22

α‐entmax: Ω(p)= 1/α(α−1)∑

j pαi

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

(Niculae and Blondel, 2017)

(Martins and Astudillo, 2016) 16

Page 55: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Smoothed Max OperatorsπΩ(θ) = argmax

p∈p⊤θ −Ω(p)

argmax: Ω(p)=0 (no smoothing)

softmax: Ω(p)=∑

j pj log pj

sparsemax: Ω(p)= 1/2∥p∥22α‐entmax: Ω(p)= 1/α(α−1)

∑j pαi

Tsallis (1988); a generalized entropy (Grünwald and Dawid, 2004)(Blondel, Martins, and Niculae 2019a;Peters, Niculae, and Martins 2019;Correia, Niculae, and Martins 2019)

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

(Niculae and Blondel, 2017)

16

Page 56: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

softmax

17

Page 57: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

sparsemax

18

Page 58: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

fusedmax ?!

19

Page 59: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Smoothed Max Operators

πΩ(θ) = argmaxp∈

p⊤θ −Ω(p)

argmax: Ω(p)=0 (no smoothing)

softmax: Ω(p)=∑

j pj log pj

sparsemax: Ω(p)= 1/2∥p∥22α‐entmax: Ω(p)= 1/α(α−1)

∑j pαi

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

(Niculae and Blondel, 2017)

20

Page 60: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Smoothed Max Operators

πΩ(θ) = argmaxp∈

p⊤θ −Ω(p)

argmax: Ω(p)=0 (no smoothing)

softmax: Ω(p)=∑

j pj log pj

sparsemax: Ω(p)= 1/2∥p∥22α‐entmax: Ω(p)= 1/α(α−1)

∑j pαi

fusedmax: Ω(p)= 1/2∥p∥22 +∑

j |pj − pj−1|

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

(Niculae and Blondel, 2017)

20

Page 61: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Smoothed Max Operators

πΩ(θ) = argmaxp∈

p⊤θ −Ω(p)

argmax: Ω(p)=0 (no smoothing)

softmax: Ω(p)=∑

j pj log pj

sparsemax: Ω(p)= 1/2∥p∥22α‐entmax: Ω(p)= 1/α(α−1)

∑j pαi

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

(Niculae and Blondel, 2017)

21

Page 62: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Smoothed Max Operators

πΩ(θ) = argmaxp∈

p⊤θ −Ω(p)

argmax: Ω(p)=0 (no smoothing)

softmax: Ω(p)=∑

j pj log pj

sparsemax: Ω(p)= 1/2∥p∥22α‐entmax: Ω(p)= 1/α(α−1)

∑j pαi

fusedmax: Ω(p)= 1/2∥p∥22 +∑

j |pj − pj−1|

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

(Niculae and Blondel, 2017)

21

Page 63: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Structured Predictionfinally

Page 64: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Structured Predictionis essentially a (very high‐dimensional) argmax

c1c2

· · ·cN

θ pinputx

outputyc2

There are exponentiallymany structures(θ cannot fit in memory!)

22

Page 65: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Structured Predictionis essentially a (very high‐dimensional) argmax

· · ·

θ pinputx

outputy

There are exponentiallymany structures(θ cannot fit in memory!)

22

Page 66: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Structured Predictionis essentially a (very high‐dimensional) argmax

· · ·

θ pinputx

outputy

There are exponentiallymany structures(θ cannot fit in memory!)

22

Page 67: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Factorization Into Partsθ =A⊤η

⋆ dog on wheels

A =

⋆→dog 1 0 0on→dog 0 1 1

wheels→dog 0 0 0⋆→on 0 1 1

dog→on 1 ... 0 0 ...wheels→on 0 0 0

⋆→wheels 0 0 0dog→wheels 0 1 0on→wheels 1 0 1

η =

.1

.2−.1.3.8.1−.3.2−.1

23

Page 68: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Factorization Into Partsθ =A⊤η

⋆ dog on wheels

A =

⋆→dog 1 0 0on→dog 0 1 1

wheels→dog 0 0 0⋆→on 0 1 1

dog→on 1 ... 0 0 ...wheels→on 0 0 0

⋆→wheels 0 0 0dog→wheels 0 1 0on→wheels 1 0 1

η =

.1

.2−.1.3.8.1−.3.2−.1

*→

dog→

on→

wheels→

dog on wheels

ay = [010 100 001]

23

Page 69: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Factorization Into Partsθ =A⊤η

⋆ dog on wheels

A =

⋆→dog 1 0 0on→dog 0 1 1

wheels→dog 0 0 0⋆→on 0 1 1

dog→on 1 ... 0 0 ...wheels→on 0 0 0

⋆→wheels 0 0 0dog→wheels 0 1 0on→wheels 1 0 1

η =

.1

.2−.1.3.8.1−.3.2−.1

*→

dog→

on→

wheels→

dog on wheels

TREE

ay = [010 100 001]

23

Page 70: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Factorization Into Partsθ =A⊤η

⋆ dog on wheels

A =

⋆→dog 1 0 0on→dog 0 1 1

wheels→dog 0 0 0⋆→on 0 1 1

dog→on 1 ... 0 0 ...wheels→on 0 0 0

⋆→wheels 0 0 0dog→wheels 0 1 0on→wheels 1 0 1

η =

.1

.2−.1.3.8.1−.3.2−.1

*→

dog→

on→

wheels→

dog on wheels

TREE

ay = [010 100 001]

23

Page 71: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Factorization Into Partsθ =A⊤η

⋆ dog on wheels

A =

⋆→dog 1 0 0on→dog 0 1 1

wheels→dog 0 0 0⋆→on 0 1 1

dog→on 1 ... 0 0 ...wheels→on 0 0 0

⋆→wheels 0 0 0dog→wheels 0 1 0on→wheels 1 0 1

η =

.1

.2−.1.3.8.1−.3.2−.1

*→

dog→

on→

wheels→

dog on wheels

TREE

ay = [010 100 001]

23

Page 72: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Factorization Into Partsθ =A⊤η

⋆ dog on wheels

A =

⋆→dog 1 0 0on→dog 0 1 1

wheels→dog 0 0 0⋆→on 0 1 1

dog→on 1 ... 0 0 ...wheels→on 0 0 0

⋆→wheels 0 0 0dog→wheels 0 1 0on→wheels 1 0 1

η =

.1

.2−.1.3.8.1−.3.2−.1

dogon

wheels

hondopwielen

A =

dog—hond 1 0 0dog—op 0 1 1dog—wielen 0 0 0on—hond 0 0 0on—op 1 ... 0 0 ...on—wielen 0 1 1

wheels—hond 0 1 0wheels—op 0 0 0wheels—wielen 1 0 1

η =

.1

.2−.1.3.8.1−.3.2−.1

23

Page 73: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)SparseMAP argmax

μ∈Mμ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres

e.g. sequence labeling→ forward‐backward(Rabiner, 1989)

As attention: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)

As attention: (Liu and Lapata, 2018)

e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)

M := convah : h ∈H

=Ap : p ∈

=EH∼p aH : p ∈

24

Page 74: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)SparseMAP argmax

μ∈Mμ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres

e.g. sequence labeling→ forward‐backward(Rabiner, 1989)

As attention: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)

As attention: (Liu and Lapata, 2018)

e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)

M := convah : h ∈H

=Ap : p ∈

=EH∼p aH : p ∈

24

Page 75: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)SparseMAP argmax

μ∈Mμ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres

e.g. sequence labeling→ forward‐backward(Rabiner, 1989)

As attention: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)

As attention: (Liu and Lapata, 2018)

e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)

M := convah : h ∈H

=Ap : p ∈

=EH∼p aH : p ∈

24

Page 76: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)SparseMAP argmax

μ∈Mμ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres

e.g. sequence labeling→ forward‐backward(Rabiner, 1989)

As attention: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)

As attention: (Liu and Lapata, 2018)

e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)

M := convah : h ∈H

=Ap : p ∈

=EH∼p aH : p ∈

24

Page 77: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)SparseMAP argmax

μ∈Mμ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres

e.g. sequence labeling→ forward‐backward(Rabiner, 1989)

As attention: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)

As attention: (Liu and Lapata, 2018)

e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)

M := convah : h ∈H

=Ap : p ∈

=EH∼p aH : p ∈

24

Page 78: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)SparseMAP argmax

μ∈Mμ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres

e.g. sequence labeling→ forward‐backward(Rabiner, 1989)

As attention: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)

As attention: (Liu and Lapata, 2018)

e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)

M := convah : h ∈H

=Ap : p ∈

=EH∼p aH : p ∈

24

Page 79: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)SparseMAP argmax

μ∈Mμ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres

e.g. sequence labeling→ forward‐backward(Rabiner, 1989)

As attention: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)

As attention: (Liu and Lapata, 2018)

e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)

M := convah : h ∈H

=Ap : p ∈

=EH∼p aH : p ∈

24

Page 80: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)SparseMAP argmax

μ∈Mμ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres

e.g. sequence labeling→ forward‐backward(Rabiner, 1989)

As attention: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)

As attention: (Liu and Lapata, 2018)

e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)

M := convah : h ∈H

=Ap : p ∈

=EH∼p aH : p ∈

24

Page 81: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)

SparseMAP argmaxμ∈M

μ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres

e.g. sequence labeling→ forward‐backward(Rabiner, 1989)

As attention: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)

As attention: (Liu and Lapata, 2018)

e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)

M := convah : h ∈H

=Ap : p ∈

=EH∼p aH : p ∈

24

Page 82: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)

SparseMAP argmaxμ∈M

μ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres

e.g. sequence labeling→ forward‐backward(Rabiner, 1989)

As attention: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)

As attention: (Liu and Lapata, 2018)

e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)

M := convah : h ∈H

=Ap : p ∈

=EH∼p aH : p ∈

24

Page 83: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)

SparseMAP argmaxμ∈M

μ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres

e.g. sequence labeling→ forward‐backward(Rabiner, 1989)

As attention: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)

As attention: (Liu and Lapata, 2018)

e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)

M := convah : h ∈H

=Ap : p ∈

=EH∼p aH : p ∈

24

Page 84: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)

SparseMAP argmaxμ∈M

μ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres

e.g. sequence labeling→ forward‐backward(Rabiner, 1989)

As attention: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)

As attention: (Liu and Lapata, 2018)

e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)

M := convah : h ∈H

=Ap : p ∈

=EH∼p aH : p ∈

24

Page 85: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)

SparseMAP argmaxμ∈M

μ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres

e.g. sequence labeling→ forward‐backward(Rabiner, 1989)

As attention: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)

As attention: (Liu and Lapata, 2018)

e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)

M := convah : h ∈H

=Ap : p ∈

=EH∼p aH : p ∈

24

Page 86: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)SparseMAP argmax

μ∈Mμ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres

e.g. sequence labeling→ forward‐backward(Rabiner, 1989)

As attention: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)

As attention: (Liu and Lapata, 2018)

e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)

(Niculae, Martins, Blondel, and Cardie, 2018)

M := convah : h ∈H

=Ap : p ∈

=EH∼p aH : p ∈

24

Page 87: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Algorithms for SparseMAPμ⋆ = argmax

μ∈Mμ⊤η − 1/2∥μ∥2

Conditional Gradient(Frank and Wolfe, 1956; Lacoste‐Julien and Jaggi, 2015)

• select a new corner ofM• update the (sparse) coefficients of p

• Update rules: vanilla, away‐step, pairwise• Quadratic objective: Active Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Martins, Figueiredo, et al., 2015)

Backward pass

∂μ∂η is sparse

computing∂μ∂η

⊤dy

takes O(dim(μ) nnz(p⋆))

quadratic objectivelinear constraints(alas, exponentially many!)

ay⋆ = argmaxμ∈M

μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηActive Set achieves

finite & linear convergence!

Completely modular: just add MAP

25

Page 88: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Algorithms for SparseMAPμ⋆ = argmax

μ∈Mμ⊤η − 1/2∥μ∥2

Conditional Gradient(Frank and Wolfe, 1956; Lacoste‐Julien and Jaggi, 2015)

• select a new corner ofM• update the (sparse) coefficients of p

• Update rules: vanilla, away‐step, pairwise• Quadratic objective: Active Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Martins, Figueiredo, et al., 2015)

Backward pass

∂μ∂η is sparse

computing∂μ∂η

⊤dy

takes O(dim(μ) nnz(p⋆))

quadratic objectivelinear constraints(alas, exponentially many!)

ay⋆ = argmaxμ∈M

μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηActive Set achieves

finite & linear convergence!

Completely modular: just add MAP

25

Page 89: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Algorithms for SparseMAPμ⋆ = argmax

μ∈Mμ⊤η − 1/2∥μ∥2

Conditional Gradient(Frank and Wolfe, 1956; Lacoste‐Julien and Jaggi, 2015)

• select a new corner ofM• update the (sparse) coefficients of p

• Update rules: vanilla, away‐step, pairwise• Quadratic objective: Active Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Martins, Figueiredo, et al., 2015)

Backward pass

∂μ∂η is sparse

computing∂μ∂η

⊤dy

takes O(dim(μ) nnz(p⋆))

quadratic objectivelinear constraints(alas, exponentially many!)

ay⋆ = argmaxμ∈M

μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηActive Set achieves

finite & linear convergence!

Completely modular: just add MAP

25

Page 90: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Algorithms for SparseMAPμ⋆ = argmax

μ∈Mμ⊤η − 1/2∥μ∥2

Conditional Gradient(Frank and Wolfe, 1956; Lacoste‐Julien and Jaggi, 2015)

• select a new corner ofM

• update the (sparse) coefficients of p

• Update rules: vanilla, away‐step, pairwise• Quadratic objective: Active Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Martins, Figueiredo, et al., 2015)

Backward pass

∂μ∂η is sparse

computing∂μ∂η

⊤dy

takes O(dim(μ) nnz(p⋆))

quadratic objectivelinear constraints(alas, exponentially many!)

ay⋆ = argmaxμ∈M

μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηActive Set achieves

finite & linear convergence!

Completely modular: just add MAP

25

Page 91: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Algorithms for SparseMAPμ⋆ = argmax

μ∈Mμ⊤η − 1/2∥μ∥2

Conditional Gradient(Frank and Wolfe, 1956; Lacoste‐Julien and Jaggi, 2015)

• select a new corner ofM

• update the (sparse) coefficients of p

• Update rules: vanilla, away‐step, pairwise• Quadratic objective: Active Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Martins, Figueiredo, et al., 2015)

Backward pass

∂μ∂η is sparse

computing∂μ∂η

⊤dy

takes O(dim(μ) nnz(p⋆))

quadratic objectivelinear constraints(alas, exponentially many!)

ay⋆ = argmaxμ∈M

μ⊤ (η −μ(t−1))︸ ︷︷ ︸eη

Active Set achievesfinite & linear convergence!

Completely modular: just add MAP

25

Page 92: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Algorithms for SparseMAPμ⋆ = argmax

μ∈Mμ⊤η − 1/2∥μ∥2

Conditional Gradient(Frank and Wolfe, 1956; Lacoste‐Julien and Jaggi, 2015)

• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away‐step, pairwise

• Quadratic objective: Active Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Martins, Figueiredo, et al., 2015)

Backward pass

∂μ∂η is sparse

computing∂μ∂η

⊤dy

takes O(dim(μ) nnz(p⋆))

quadratic objectivelinear constraints(alas, exponentially many!)

ay⋆ = argmaxμ∈M

μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηActive Set achieves

finite & linear convergence!

Completely modular: just add MAP

25

Page 93: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Algorithms for SparseMAPμ⋆ = argmax

μ∈Mμ⊤η − 1/2∥μ∥2

Conditional Gradient(Frank and Wolfe, 1956; Lacoste‐Julien and Jaggi, 2015)

• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away‐step, pairwise• Quadratic objective: Active Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Martins, Figueiredo, et al., 2015)

Backward pass

∂μ∂η is sparse

computing∂μ∂η

⊤dy

takes O(dim(μ) nnz(p⋆))

quadratic objectivelinear constraints(alas, exponentially many!)

ay⋆ = argmaxμ∈M

μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηActive Set achieves

finite & linear convergence!

Completely modular: just add MAP

25

Page 94: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Algorithms for SparseMAPμ⋆ = argmax

μ∈Mμ⊤η − 1/2∥μ∥2

Conditional Gradient(Frank and Wolfe, 1956; Lacoste‐Julien and Jaggi, 2015)

• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away‐step, pairwise• Quadratic objective: Active Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Martins, Figueiredo, et al., 2015)

Backward pass

∂μ∂η is sparse

computing∂μ∂η

⊤dy

takes O(dim(μ) nnz(p⋆))

quadratic objectivelinear constraints(alas, exponentially many!)

ay⋆ = argmaxμ∈M

μ⊤ (η −μ(t−1))︸ ︷︷ ︸eη

Active Set achievesfinite & linear convergence!

Completely modular: just add MAP

25

Page 95: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Algorithms for SparseMAPμ⋆ = argmax

μ∈Mμ⊤η − 1/2∥μ∥2

Conditional Gradient(Frank and Wolfe, 1956; Lacoste‐Julien and Jaggi, 2015)

• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away‐step, pairwise• Quadratic objective: Active Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Martins, Figueiredo, et al., 2015)

Backward pass

∂μ∂η is sparse

computing∂μ∂η

⊤dy

takes O(dim(μ) nnz(p⋆))

quadratic objectivelinear constraints(alas, exponentially many!)

ay⋆ = argmaxμ∈M

μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηActive Set achieves

finite & linear convergence!

Completely modular: just add MAP

25

Page 96: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Algorithms for SparseMAPμ⋆ = argmax

μ∈Mμ⊤η − 1/2∥μ∥2

Conditional Gradient(Frank and Wolfe, 1956; Lacoste‐Julien and Jaggi, 2015)

• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away‐step, pairwise• Quadratic objective: Active Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Martins, Figueiredo, et al., 2015)

Backward pass

∂μ∂η is sparse

computing∂μ∂η

⊤dy

takes O(dim(μ) nnz(p⋆))

quadratic objectivelinear constraints(alas, exponentially many!)

ay⋆ = argmaxμ∈M

μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηActive Set achieves

finite & linear convergence!

Completely modular: just add MAP

25

Page 97: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Algorithms for SparseMAPμ⋆ = argmax

μ∈Mμ⊤η − 1/2∥μ∥2

Conditional Gradient(Frank and Wolfe, 1956; Lacoste‐Julien and Jaggi, 2015)

• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away‐step, pairwise• Quadratic objective: Active Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Martins, Figueiredo, et al., 2015)

Backward pass

∂μ∂η is sparse

computing∂μ∂η

⊤dy

takes O(dim(μ) nnz(p⋆))

quadratic objectivelinear constraints(alas, exponentially many!)

ay⋆ = argmaxμ∈M

μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηActive Set achieves

finite & linear convergence!

Completely modular: just add MAP

25

Page 98: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

SparseMAP Applications

• Sparse alignment attention (more later)(Niculae, Martins, Blondel, and Cardie, 2018)

• Latent TreeLSTM(Niculae, Martins, and Cardie, 2018)

• As loss: supervised dependency parsing(Niculae, Martins, Blondel, and Cardie 2018;Blondel, Martins, and Niculae 2019b)

26

Page 99: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Latent Dependency TreesListOps (Nangia and Bowman, 2018)

Arity tagging with latent GCN (Corro and Titov, 2019; Kipf and Welling, 2017)

(max 2 9 (min 4 7 ) 0 )

27

Page 100: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Latent Dependency TreesListOps (Nangia and Bowman, 2018)

Arity tagging with latent GCN (Corro and Titov, 2019; Kipf and Welling, 2017)

(max 2 9 (min 4 7 ) 0 )4 - - 2 - - - - -

27

Page 101: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

0 20 40 60 80 100

0.2

0.4

0.6

0.8

1

epoch

validationF1score Gold tree

Left‐to‐right

28

Page 102: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

0 20 40 60 80 100

0.2

0.4

0.6

0.8

1

epoch

validationF1score Gold tree

Latent treeLeft‐to‐right

28

Page 103: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

What if MAP is notavailable?

Page 104: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Multiple, Overlapping Factors

Maximization in factor graphs: NP‐hard, even when each factor is tractable.

⋆ dog on wheels

*→

dog→

on→

wheels→

TREE

BUDGET

BUDGET

BUDGET

BUDGET

29

Page 105: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Multiple, Overlapping Factors

Maximization in factor graphs: NP‐hard, even when each factor is tractable.

⋆ dog on wheels

*→

dog→

on→

wheels→TREE

BUDGET

BUDGET

BUDGET

BUDGET

29

Page 106: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Multiple, Overlapping Factors

Maximization in factor graphs: NP‐hard, even when each factor is tractable.

⋆ dog on wheels

*→

dog→

on→

wheels→TREE

BUDGET

BUDGET

BUDGET

BUDGET

29

Page 107: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Multiple, Overlapping FactorsMaximization in factor graphs: NP‐hard, even when each factor is tractable.

⋆ dog on wheels

*→

dog→

on→

wheels→TREE

BUDGET

BUDGET

BUDGET

BUDGET

29

Page 108: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Optimization as Consensus-Seeking

μ[1:3]

μ[4:6]

μ[7:9]

fa

fb

μa

μb

Agreement on overlap: μa,[4:6] =μb,[4:6] =μ[4:6]

maxμf

∑f∈F

η⊤f μf s.t.

Cfμ =μf,

μf ∈Mf for f ∈ F

LP relaxation (Wainwright and Jordan, 2008)

the local polytope:L :=μ : Cfμ ∈Mf, f ∈ F ⊇M

M

L

(Niculae and Martins, 2020)

30

Page 109: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Optimization as Consensus-Seeking

μ[1:3]

μ[4:6]

μ[7:9]

fa

fb

μa

μb

Agreement on overlap: μa,[4:6] =μb,[4:6] =μ[4:6]

maxμf

∑f∈F

η⊤f μf s.t.

Cfμ =μf,

μf ∈Mf for f ∈ F

LP relaxation (Wainwright and Jordan, 2008)

the local polytope:L :=μ : Cfμ ∈Mf, f ∈ F ⊇M

M

L

(Niculae and Martins, 2020)

30

Page 110: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Optimization as Consensus-Seeking

μ[1:3]

μ[4:6]

μ[7:9]

fa

fb

μa

μb

Agreement on overlap: μa,[4:6] =μb,[4:6] =μ[4:6]

maxμf

∑f∈F

η⊤f μf s.t.

Cfμ =μf,

μf ∈Mf for f ∈ F

LP relaxation (Wainwright and Jordan, 2008)

the local polytope:L :=μ : Cfμ ∈Mf, f ∈ F ⊇M

M

L

(Niculae and Martins, 2020)

30

Page 111: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Optimization as Consensus-Seeking

μ[1:3]

μ[4:6]

μ[7:9]

fa

fb

μa

μb

Agreement on overlap: μa,[4:6] =μb,[4:6] =μ[4:6]

maxμf

∑f∈F

η⊤f μf s.t.

Cfμ =μf,

μf ∈Mf for f ∈ F

LP relaxation (Wainwright and Jordan, 2008)

the local polytope:L :=μ : Cfμ ∈Mf, f ∈ F ⊇M

M

L

(Niculae and Martins, 2020)

30

Page 112: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Optimization as Consensus-Seeking

μ[1:3]

μ[4:6]

μ[7:9]

fa

fb

μa

μb

Agreement on overlap: μa,[4:6] =μb,[4:6] =μ[4:6]

maxμ,μf

∑f∈F

η⊤f μf s.t. Cfμ =μf, μf ∈Mf for f ∈ F

LP relaxation (Wainwright and Jordan, 2008)

the local polytope:L :=μ : Cfμ ∈Mf, f ∈ F ⊇M

M

L

(Niculae and Martins, 2020)

30

Page 113: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Optimization as Consensus-Seeking

μ[1:3]

μ[4:6]

μ[7:9]

fa

fb

μa

μb

Agreement on overlap: μa,[4:6] =μb,[4:6] =μ[4:6]

maxμ,μf

∑f∈F

η⊤f μf s.t. Cfμ =μf, μf ∈Mf for f ∈ F

LP relaxation (Wainwright and Jordan, 2008)

the local polytope:L :=μ : Cfμ ∈Mf, f ∈ F ⊇M

M

L

(Niculae and Martins, 2020)

30

Page 114: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Optimization as Consensus-Seeking

μ[1:3]

μ[4:6]

μ[7:9]

fa

fb

μa

μb

Agreement on overlap: μa,[4:6] =μb,[4:6] =μ[4:6]

maxμ,μf

∑f∈F

η⊤f μf s.t. Cfμ =μf, μf ∈Mf for f ∈ F

LP relaxation (Wainwright and Jordan, 2008)

the local polytope:L :=μ : Cfμ ∈Mf, f ∈ F ⊇M

M

L

(Niculae and Martins, 2020)

30

Page 115: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Optimization as Consensus-Seeking

μ[1:3]

μ[4:6]

μ[7:9]

fa

fb

μa

μb

Agreement on overlap: μa,[4:6] =μb,[4:6] =μ[4:6]

maxμ,μf

∑f∈F

η⊤f μf− 1/2∥μ∥2 s.t. Cfμ =μf, μf ∈Mf for f ∈ F

LP relaxation (Wainwright and Jordan, 2008)

the local polytope:L :=μ : Cfμ ∈Mf, f ∈ F ⊇M

M

L

(Niculae and Martins, 2020)

30

Page 116: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Algorithms for LP-SparseMAP(Niculae and Martins, 2020)

Forward pass

argmaxCfμ=μf

∑f∈F

η⊤f μf− 1/2∥μ∥2

=argmaxCf μ=μf

∑f∈Fη⊤f μf − 1/2∥Dfμf∥2

• Separable objective,agreement constraintsADMM in consensus form

• SparseMAP subproblem for each f

Backward pass

• Sparse fixed‐point iteration

• Combines the SparseMAP Jacobiansof each factor

31

Page 117: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Algorithms for LP-SparseMAP(Niculae and Martins, 2020)

Forward pass

argmaxCfμ=μf

∑f∈F

η⊤f μf− 1/2∥μ∥2

=argmaxCf μ=μf

∑f∈Fη⊤f μf − 1/2∥Dfμf∥2

• Separable objective,agreement constraintsADMM in consensus form

• SparseMAP subproblem for each f

Backward pass

• Sparse fixed‐point iteration

• Combines the SparseMAP Jacobiansof each factor

31

Page 118: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Differentiable Sparse Structured Prediction(Niculae and Martins, 2020)

TREE

BUDGET

BUDGET

BUDGET

BUDGET

Factor graphs as a hidden‐layer DSL!

If |F | = 1, recovers SparseMAP.Modular library.Built‐in specialized factors:• OR, XOR, AND• OR‐with‐output• Budget, Knapsack• Pairwise

New factors only require MAP.

fg = FactorGraph()var = [fg.variable() for i =/ j] # handwave

fg.add(Tree(var))

for i in range(n):fg.add(Budget(var[i, :], budget=5)

μ = fg.lp_sparsemap(η)

32

Page 119: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Differentiable Sparse Structured Prediction(Niculae and Martins, 2020)

TREE

BUDGET

BUDGET

BUDGET

BUDGET

Factor graphs as a hidden‐layer DSL!If |F | = 1, recovers SparseMAP.

Modular library.Built‐in specialized factors:• OR, XOR, AND• OR‐with‐output• Budget, Knapsack• Pairwise

New factors only require MAP.

fg = FactorGraph()var = [fg.variable() for i =/ j] # handwave

fg.add(Tree(var))

for i in range(n):fg.add(Budget(var[i, :], budget=5)

μ = fg.lp_sparsemap(η)

32

Page 120: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Differentiable Sparse Structured Prediction(Niculae and Martins, 2020)

TREE

BUDGET

BUDGET

BUDGET

BUDGET

Factor graphs as a hidden‐layer DSL!If |F | = 1, recovers SparseMAP.Modular library.Built‐in specialized factors:• OR, XOR, AND• OR‐with‐output• Budget, Knapsack• Pairwise

New factors only require MAP.

class Factor:def map(ηf): # abstract, private

raise NotImplemented

def sparsemap(ηf):# active set algo, uses self.map

def backward(dμf):# analytic, uses active set result

class Budget(Factor):def sparsemap(ηf):

# specialized

def backward(dμf):# specialized

32

Page 121: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Differentiable Sparse Structured Prediction(Niculae and Martins, 2020)

TREE

BUDGET

BUDGET

BUDGET

BUDGET

Factor graphs as a hidden‐layer DSL!If |F | = 1, recovers SparseMAP.Modular library.Built‐in specialized factors:• OR, XOR, AND• OR‐with‐output• Budget, Knapsack• Pairwise

New factors only require MAP.

class Factor:def map(ηf): # abstract, private

raise NotImplemented

def sparsemap(ηf):# active set algo, uses self.map

def backward(dμf):# analytic, uses active set result

class Budget(Factor):def sparsemap(ηf):

# specialized

def backward(dμf):# specialized

class Tree(Factor):def map(η):

# Chu-Liu/Edmonds algo32

Page 122: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

0 20 40 60 80 100

0.2

0.4

0.6

0.8

1

epoch

validationF1score Gold tree

Latent treeLeft‐to‐right

33

Page 123: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

0 20 40 60 80 100

0.2

0.4

0.6

0.8

1

epoch

validationF1score Gold tree

Latent w/ Budget(5)Latent treeLeft‐to‐right

33

Page 124: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Structured Attention for AlignmentsNLI premise: A gentleman overlooking a neighborhood situation.

hypothesis: A police officer watches a situation closely.

input

(P, H)

A

gentleman

overlooking

...

situation

A

police

officer...

closely

output

entails

contradicts

neutral

(Model: decomposable attention (Parikh et al., 2016))34

Page 125: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Structured Attention for AlignmentsNLI premise: A gentleman overlooking a neighborhood situation.

hypothesis: A police officer watches a situation closely.

input

(P, H)

A

gentleman

overlooking

...

situation

A

police

officer...

closely

output

entails

contradicts

neutral

(Model: decomposable attention (Parikh et al., 2016))34

Page 126: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Structured Attention for AlignmentsNLI premise: A gentleman overlooking a neighborhood situation.

hypothesis: A police officer watches a situation closely.

input

(P, H)

A

gentleman

overlooking

...

situation

A

police

officer...

closely

output

entails

contradicts

neutral

(Model: decomposable attention (Parikh et al., 2016))34

Page 127: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Structured Attention for AlignmentsNLI premise: A gentleman overlooking a neighborhood situation.

hypothesis: A police officer watches a situation closely.

input

(P, H)

A

gentleman

overlooking

...

situation

A

police

officer...

closely

output

entails

contradicts

neutral

(Proposed model: global structured alignment.)34

Page 128: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Structured Alignment Modelsmatching

SparseMAP w/ Kuhn‐Munkres(Kuhn, 1955)

LP‐matching

LP‐SparseMAP w/ XORs(equivalent; different solver)

LP‐sequence

additional scorefor contiguous alignments

(i, j) − (i + 1, j± 1)

35

Page 129: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Structured Alignment Modelsmatching

SparseMAP w/ Kuhn‐Munkres(Kuhn, 1955)

LP‐matching

LP‐SparseMAP w/ XORs(equivalent; different solver)

LP‐sequence

additional scorefor contiguous alignments

(i, j) − (i + 1, j± 1)

35

Page 130: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Structured Alignment Modelsmatching

SparseMAP w/ Kuhn‐Munkres(Kuhn, 1955)

LP‐matching

LP‐SparseMAP w/ XORs(equivalent; different solver)

LP‐sequence

additional scorefor contiguous alignments

(i, j) − (i + 1, j± 1)35

Page 131: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

MultiNLI (Williams et al., 2017)

softmax matching LP‐matching LP‐sequence66%

68%

70%

72%

36

Page 132: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

a

gentleman

overlooking

a

neighborhood

situation

.

a

police

officer

watches a

situation

closely . a

police

officer

watches a

situation

closely .

37

Page 133: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

a

police

officer

watches a

situation

closely .

a

gentleman

overlooking

a

neighborhood

situation

.

A

gentleman

overlooking

a

neighborhood

situation

.

A

police

officer

watches

a

situation

closely

.

38

Page 134: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

ConclusionsDifferentiable & sparsestructured inference

Generic, extensible, efficient algorithms

Interpretable structured attention

Future workStructure beyond NLP

Weak & semi‐supervision

Generative latent structure models

[email protected] github.com/deep-spin/lp-sparsemaphttps://vene.ro @vnfrombucharest

· · ·

· · ·

a

police

officer

watches a

situation

closely .

a

gentleman

overlooking

a

neighborhood

situation

.

TREE

BUDGET

BUDGET

BUDGET

BUDGET

Page 135: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

ConclusionsDifferentiable & sparsestructured inference

Generic, extensible, efficient algorithms

Interpretable structured attention

Future workStructure beyond NLP

Weak & semi‐supervision

Generative latent structure models

[email protected] github.com/deep-spin/lp-sparsemaphttps://vene.ro @vnfrombucharest

· · ·

· · ·

a

police

officer

watches a

situation

closely .

a

gentleman

overlooking

a

neighborhood

situation

.

TREE

BUDGET

BUDGET

BUDGET

BUDGET

Page 136: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Extra slides

Page 137: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Acknowledgements

This work was supported by the European Research Council (ERC StG DeepSPIN 758969) and by theFundação para a Ciência e Tecnologia through contract UID/EEA/50008/2013.

Some icons by Dave Gandy and Freepik via flaticon.com.

40

Page 138: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Sparsemaxsparsemax(θ) = argmax

p∈p⊤θ − 1/2∥p∥22

= argminp∈∥p − θ∥22

Computation:

p⋆ = [θ − τ1]+θi > θj⇒ pi ≥ pjO(d) via partial sort

(Held et al., 1974; Brucker, 1984; Condat, 2016)

Backward pass:

Jsparsemax = diag(s) − 1|S|ss⊤

where S = j : p⋆j > 0,sj = Jj ∈ SK

(Martins and Astudillo, 2016)

argmin differentiation(Gould et al., 2016; Amos and Kolter, 2017)

41

Page 139: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Sparsemaxsparsemax(θ) = argmax

p∈p⊤θ − 1/2∥p∥22

= argminp∈∥p − θ∥22

Computation:

p⋆ = [θ − τ1]+θi > θj⇒ pi ≥ pjO(d) via partial sort

(Held et al., 1974; Brucker, 1984; Condat, 2016)

Backward pass:

Jsparsemax = diag(s) − 1|S|ss⊤

where S = j : p⋆j > 0,sj = Jj ∈ SK

(Martins and Astudillo, 2016)

argmin differentiation(Gould et al., 2016; Amos and Kolter, 2017)

41

Page 140: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Sparsemaxsparsemax(θ) = argmax

p∈p⊤θ − 1/2∥p∥22

= argminp∈∥p − θ∥22

Computation:

p⋆ = [θ − τ1]+θi > θj⇒ pi ≥ pjO(d) via partial sort

(Held et al., 1974; Brucker, 1984; Condat, 2016)

Backward pass:

Jsparsemax = diag(s) − 1|S|ss⊤

where S = j : p⋆j > 0,sj = Jj ∈ SK

(Martins and Astudillo, 2016)

argmin differentiation(Gould et al., 2016; Amos and Kolter, 2017)

41

Page 141: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Sparsemaxsparsemax(θ) = argmax

p∈p⊤θ − 1/2∥p∥22

= argminp∈∥p − θ∥22

Computation:

p⋆ = [θ − τ1]+θi > θj⇒ pi ≥ pjO(d) via partial sort

(Held et al., 1974; Brucker, 1984; Condat, 2016)

Backward pass:

Jsparsemax = diag(s) − 1|S|ss⊤

where S = j : p⋆j > 0,sj = Jj ∈ SK

(Martins and Astudillo, 2016)

argmin differentiation(Gould et al., 2016; Amos and Kolter, 2017)

41

Page 142: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Fusedmax

fusedmax(θ) = argmaxp∈

p⊤θ − 1/2∥p∥22 −∑2≤j≤d|pj − pj−1|

= argminp∈∥p − θ∥22 +∑2≤j≤d|pj − pj−1|

proxfused(θ) = argminp∈Rd∥p − θ∥22 +∑2≤j≤d|pj − pj−1|

Proposition: fusedmax(θ) = sparsemaxproxfused(θ)

(Niculae and Blondel, 2017)

“Fused Lasso” a.k.a. 1‐d Total Variation

(Tibshirani et al., 2005)

42

Page 143: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Fusedmax

fusedmax(θ) = argmaxp∈

p⊤θ − 1/2∥p∥22 −∑2≤j≤d|pj − pj−1|

= argminp∈∥p − θ∥22 +∑2≤j≤d|pj − pj−1|

proxfused(θ) = argminp∈Rd∥p − θ∥22 +∑2≤j≤d|pj − pj−1|

Proposition: fusedmax(θ) = sparsemaxproxfused(θ)

(Niculae and Blondel, 2017)

“Fused Lasso” a.k.a. 1‐d Total Variation

(Tibshirani et al., 2005)

42

Page 144: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Danskin’s Theorem(Danskin, 1966; Prop. B.25 in Bertsekas, 1999)

Let φ : Rd × Z→ R,Z ⊂ Rd compact.∂max

z∈Z φ(x, z) = conv∇xφ(x, z⋆) | z⋆ ∈ argmaxz∈Z

φ(x, z).

Example: maximum of a vector

∂maxj∈[d] θj = ∂max

p∈ p⊤θ

= ∂maxp∈ φ(p,θ)

= conv∇θφ(p⋆,θ)= convp⋆

θ = [t,0]

t0

1

−1 0 +1maxj θj

t0

1

−1 0 +1g1 | g ∈ ∂maxj θj

43

Page 145: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Danskin’s Theorem(Danskin, 1966; Prop. B.25 in Bertsekas, 1999)

Let φ : Rd × Z→ R,Z ⊂ Rd compact.∂max

z∈Z φ(x, z) = conv∇xφ(x, z⋆) | z⋆ ∈ argmaxz∈Z

φ(x, z).

Example: maximum of a vector

∂maxj∈[d] θj = ∂max

p∈ p⊤θ

= ∂maxp∈ φ(p,θ)

= conv∇θφ(p⋆,θ)= convp⋆

θ = [t,0]

t0

1

−1 0 +1maxj θj

t0

1

−1 0 +1g1 | g ∈ ∂maxj θj

43

Page 146: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Danskin’s Theorem(Danskin, 1966; Prop. B.25 in Bertsekas, 1999)

Let φ : Rd × Z→ R,Z ⊂ Rd compact.∂max

z∈Z φ(x, z) = conv∇xφ(x, z⋆) | z⋆ ∈ argmaxz∈Z

φ(x, z).

Example: maximum of a vector

∂maxj∈[d] θj = ∂max

p∈ p⊤θ

= ∂maxp∈ φ(p,θ)

= conv∇θφ(p⋆,θ)= convp⋆

θ = [t,0]

t0

1

−1 0 +1maxj θj

t0

1

−1 0 +1g1 | g ∈ ∂maxj θj

43

Page 147: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Dynamically inferringthe computation graph

Page 148: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

So far: a structured hidden layerEH[aH]

Network must handle “soft” combinations of structures.Fine for attention, but can be limiting.

Page 149: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Dependency TreeLSTM

The bears eat the pretty ones

(Tai et al., 2015)

45

Page 150: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Dependency TreeLSTM

The bears eat the pretty ones

(Tai et al., 2015)

45

Page 151: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Dependency TreeLSTM

The bears eat the pretty ones

(Tai et al., 2015)

45

Page 152: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Dependency TreeLSTM

The bears eat the pretty ones

(Tai et al., 2015)

45

Page 153: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Dependency TreeLSTM

The bears eat the pretty ones

(Tai et al., 2015)

45

Page 154: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Dependency TreeLSTM

The bears eat the pretty ones

(Tai et al., 2015)

45

Page 155: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Dependency TreeLSTM

The bears eat the pretty ones

(Tai et al., 2015)

45

Page 156: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Latent Dependency TreeLSTM

input

x

p(y|x) =∑h∈H

p(y | h, x) p(h | x)

h ∈H

The bears eat the pretty ones

output

y

(Niculae, Martins, and Cardie, 2018)

46

Page 157: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Latent Dependency TreeLSTM

input

x

p(y|x) =∑h∈H

p(y | h, x) p(h | x)

h ∈HThe bears eat the pretty ones

output

y

(Niculae, Martins, and Cardie, 2018)

46

Page 158: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Structured Latent Variable Models

p(y | x) = ∑h∈H

p

φ

(y | h, x) p

π

(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp

scoreπ(h; x)

softmaxidea 3 SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponentially large sum!

47

Page 159: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Structured Latent Variable Models

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp

scoreπ(h; x)

softmaxidea 3 SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponentially large sum!

47

Page 160: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Structured Latent Variable Models

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp

scoreπ(h; x)

softmaxidea 3 SparseMAP

e.g., a TreeLSTM defined by h

sum overall possible trees

parsing model,using some scoreπ(h; x)

Exponentially large sum!

47

Page 161: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Structured Latent Variable Models

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp

scoreπ(h; x)

softmaxidea 3 SparseMAP

e.g., a TreeLSTM defined by h

sum overall possible trees

parsing model,using some scoreπ(h; x)

Exponentially large sum!

47

Page 162: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Structured Latent Variable Models

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp

scoreπ(h; x)

softmaxidea 3 SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponentially large sum!

47

Page 163: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Structured Latent Variable Models

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

How to define pπ?

∑h∈H

∂p(y | x)∂π

idea 1

pπ(h | x) = 1 if h = h⋆ else 0 argmax

idea 2

pπ(h | x) ∝ expscoreπ(h; x)

softmax

idea 3

SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponentially large sum!

47

Page 164: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Structured Latent Variable Models

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1

pπ(h | x) = 1 if h = h⋆ else 0 argmax

idea 2

pπ(h | x) ∝ expscoreπ(h; x)

softmax

idea 3

SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponentially large sum!

47

Page 165: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Structured Latent Variable Models

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1

pπ(h | x) = 1 if h = h⋆ else 0 argmax

idea 2

pπ(h | x) ∝ expscoreπ(h; x)

softmax

idea 3

SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponentially large sum!

47

Page 166: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Structured Latent Variable Models

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2

pπ(h | x) ∝ expscoreπ(h; x)

softmax

idea 3

SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponentially large sum!

47

Page 167: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Structured Latent Variable Models

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2

pπ(h | x) ∝ expscoreπ(h; x)

softmax

idea 3

SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponentially large sum!

47

Page 168: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Structured Latent Variable Models

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2

pπ(h | x) ∝ expscoreπ(h; x)

softmax

idea 3

SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponentially large sum!

47

Page 169: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Structured Latent Variable Models

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp

scoreπ(h; x)

softmaxidea 3

SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponentially large sum!

47

Page 170: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Structured Latent Variable Models

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp

scoreπ(h; x)

softmaxidea 3

SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponentially large sum!

47

Page 171: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Structured Latent Variable Models

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp

scoreπ(h; x)

softmaxidea 3

SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponentially large sum!

47

Page 172: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Structured Latent Variable Models

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp

scoreπ(h; x)

softmaxidea 3 SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponentially large sum!

47

Page 173: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

SparseMAP

• • •= .7 • • • + .3 • • •

+ 0 • • • + ...p(y | x)= .7 pφ(y | • • •)+ .3 pφ(y | • • • )

48

Page 174: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

SparseMAP

• • •= .7 • • • + .3 • • • + 0 • • • + ...

p(y | x)= .7 pφ(y | • • •)+ .3 pφ(y | • • • )

48

Page 175: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

SparseMAP

• • •= .7 • • • + .3 • • • + 0 • • • + ...p(y | x)= .7 pφ(y | • • •)+ .3 pφ(y | • • • )

48

Page 176: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Sentiment classification (SST)

80%

81%

82%

83%

84%

85%

Natural Language Inference (SNLI)

LTR Flat CoreNLP Latent

80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%

82%accuracy(3‐class)

Reverse dictionary lookup(definitions)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

⋆ The bears eat the pretty ones

Left‐to‐right: regular LSTM

⋆ The bears eat the pretty ones

Flat: bag‐of‐words–like

⋆ The bears eat the pretty ones

CoreNLP: off‐line parser

Sentence pair classification (P,H)

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word description, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =

∑h∈H

g(x;h) pπ(h | x)

Page 177: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Sentiment classification (SST)

LTR80%

81%

82%

83%

84%

85%

Natural Language Inference (SNLI)

LTR Flat CoreNLP Latent

80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%

82%accuracy(3‐class)

Reverse dictionary lookup(definitions)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

⋆ The bears eat the pretty ones

Left‐to‐right: regular LSTM

⋆ The bears eat the pretty ones

Flat: bag‐of‐words–like

⋆ The bears eat the pretty ones

CoreNLP: off‐line parser

Sentence pair classification (P,H)

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word description, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =

∑h∈H

g(x;h) pπ(h | x)

Page 178: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Sentiment classification (SST)

LTR Flat80%

81%

82%

83%

84%

85%

Natural Language Inference (SNLI)

LTR Flat CoreNLP Latent

80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%

82%accuracy(3‐class)

Reverse dictionary lookup(definitions)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

⋆ The bears eat the pretty ones

Left‐to‐right: regular LSTM

⋆ The bears eat the pretty ones

Flat: bag‐of‐words–like

⋆ The bears eat the pretty ones

CoreNLP: off‐line parser

Sentence pair classification (P,H)

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word description, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =

∑h∈H

g(x;h) pπ(h | x)

Page 179: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Sentiment classification (SST)

LTR Flat CoreNLP80%

81%

82%

83%

84%

85%

Natural Language Inference (SNLI)

LTR Flat CoreNLP Latent

80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%

82%accuracy(3‐class)

Reverse dictionary lookup(definitions)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

⋆ The bears eat the pretty ones

Left‐to‐right: regular LSTM

⋆ The bears eat the pretty ones

Flat: bag‐of‐words–like

⋆ The bears eat the pretty ones

CoreNLP: off‐line parser

Sentence pair classification (P,H)

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word description, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =

∑h∈H

g(x;h) pπ(h | x)

Page 180: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Sentiment classification (SST)

LTR Flat CoreNLP Latent80%

81%

82%

83%

84%

85%

Natural Language Inference (SNLI)

LTR Flat CoreNLP Latent

80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%

82%accuracy(3‐class)

Reverse dictionary lookup(definitions)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

⋆ The bears eat the pretty ones

Left‐to‐right: regular LSTM

⋆ The bears eat the pretty ones

Flat: bag‐of‐words–like

⋆ The bears eat the pretty ones

CoreNLP: off‐line parser

Sentence pair classification (P,H)

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word description, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =

∑h∈H

g(x;h) pπ(h | x)

Page 181: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Sentiment classification (SST)

LTR Flat CoreNLP Latent80%

81%

82%

83%

84%

85%accuracy(binary)

Natural Language Inference (SNLI)

LTR Flat CoreNLP Latent

80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%

82%accuracy(3‐class)

Reverse dictionary lookup(definitions)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

⋆ The bears eat the pretty ones

Left‐to‐right: regular LSTM

⋆ The bears eat the pretty ones

Flat: bag‐of‐words–like

⋆ The bears eat the pretty ones

CoreNLP: off‐line parser

Sentence pair classification (P,H)

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word description, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =

∑h∈H

g(x;h) pπ(h | x)

Page 182: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Sentiment classification (SST)

LTR Flat CoreNLP Latent80%

81%

82%

83%

84%

85%accuracy(binary)

Natural Language Inference (SNLI)

LTR Flat CoreNLP Latent

80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%

82%accuracy(3‐class)

Reverse dictionary lookup(definitions)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

⋆ The bears eat the pretty ones

Left‐to‐right: regular LSTM

⋆ The bears eat the pretty ones

Flat: bag‐of‐words–like

⋆ The bears eat the pretty ones

CoreNLP: off‐line parser

Sentence pair classification (P,H)

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)

given word description, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =

∑h∈H

g(x;h) pπ(h | x)

Page 183: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Sentiment classification (SST)

LTR Flat CoreNLP Latent80%

81%

82%

83%

84%

85%accuracy(binary)

Natural Language Inference (SNLI)

LTR Flat CoreNLP Latent

80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%

82%accuracy(3‐class)

Reverse dictionary lookup

(definitions)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

⋆ The bears eat the pretty ones

Left‐to‐right: regular LSTM

⋆ The bears eat the pretty ones

Flat: bag‐of‐words–like

⋆ The bears eat the pretty ones

CoreNLP: off‐line parser

Sentence pair classification (P,H)

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)

given word description, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =

∑h∈H

g(x;h) pπ(h | x)

Page 184: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Sentiment classification (SST)

LTR Flat CoreNLP Latent80%

81%

82%

83%

84%

85%accuracy(binary)

Natural Language Inference (SNLI)

LTR Flat CoreNLP Latent

80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%

82%accuracy(3‐class)

Reverse dictionary lookup(definitions)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

⋆ The bears eat the pretty ones

Left‐to‐right: regular LSTM

⋆ The bears eat the pretty ones

Flat: bag‐of‐words–like

⋆ The bears eat the pretty ones

CoreNLP: off‐line parser

Sentence pair classification (P,H)

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word description, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =

∑h∈H

g(x;h) pπ(h | x)

Page 185: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Sentiment classification (SST)

LTR Flat CoreNLP Latent80%

81%

82%

83%

84%

85%accuracy(binary)

Natural Language Inference (SNLI)

LTR Flat CoreNLP Latent

80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%

82%accuracy(3‐class)

Reverse dictionary lookup(definitions)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

⋆ The bears eat the pretty ones

Left‐to‐right: regular LSTM

⋆ The bears eat the pretty ones

Flat: bag‐of‐words–like

⋆ The bears eat the pretty ones

CoreNLP: off‐line parser

Sentence pair classification (P,H)

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word description, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =

∑h∈H

g(x;h) pπ(h | x)

Page 186: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Syntax vs. Composition Order

p = 22.6%

⋆ lovely and poignant .

CoreNLP parse, p = 21.4%

⋆ lovely and poignant .

· · ·

50

Page 187: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Syntax vs. Composition Orderp = 22.6%

⋆ lovely and poignant .

CoreNLP parse, p = 21.4%

⋆ lovely and poignant .

· · · 50

Page 188: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Syntax vs. Composition Order

p = 22.6%

⋆ lovely and poignant .

CoreNLP parse, p = 21.4%

⋆ lovely and poignant .

· · ·

p = 15.33%

⋆ a deep and meaningful film .1.0 1.0

1.01.0

1.0 1.0

p = 15.27%

⋆ a deep and meaningful film .

1.01.0

1.01.0

1.01.0

· · ·CoreNLP parse, p = 0%

⋆ a deep and meaningful film .

1.0 1.0

1.01.0

1.0

1.0

51

Page 189: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Structured Output Prediction

SparseMAP LA(η, μ) = maxμ∈Mη⊤μ − 1/2∥μ∥2

− η⊤μ + 1/2∥μ∥2

cost‐SparseMAP LρA(η, μ) = maxμ∈Mη⊤μ − 1/2∥μ∥2+ρ(μ, μ)

− η⊤μ + 1/2∥μ∥2

Instance of a structured Fenchel‐Young loss, like CRF, SVM, etc. (Blondel, Martins, and Niculae, 2019b)

52

Page 190: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

Structured Output Prediction

SparseMAP LA(η, μ) = maxμ∈Mη⊤μ − 1/2∥μ∥2

− η⊤μ + 1/2∥μ∥2cost‐SparseMAP LρA(η, μ) = max

μ∈Mη⊤μ − 1/2∥μ∥2+ρ(μ, μ)

− η⊤μ + 1/2∥μ∥2

Instance of a structured Fenchel‐Young loss, like CRF, SVM, etc. (Blondel, Martins, and Niculae, 2019b)

52

Page 191: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP
Page 192: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP
Page 193: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP
Page 194: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP
Page 195: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP
Page 196: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

References IAmos, Brandon and J. Zico Kolter (2017). “OptNet: Differentiable optimization as a layer in neural networks”. In: Proc. of ICML.

Andre‐Obrecht, Regine (1988). “A new statistical approach for the automatic segmentation of continuous speech signals”. In: IEEETransactions on Acoustics, Speech, and Signal Processing 36.1, pp. 29–40.

Bertsekas, Dimitri P (1999). Nonlinear Programming. Athena Scientific Belmont.

Blondel, Mathieu, André FT Martins, and Vlad Niculae (2019a). “Learning classifiers with Fenchel‐Young losses: Generalizedentropies, margins, and algorithms”. In: Proc. of AISTATS.

— (2019b). “Learning with Fenchel‐Young Losses”. In: preprint arXiv:1901.02324.

Brucker, Peter (1984). “An O(n) algorithm for quadratic knapsack problems”. In: Operations Research Letters 3.3, pp. 163–166.

Condat, Laurent (2016). “Fast projection onto the simplex and the ℓ1 ball”. In:Mathematical Programming 158.1‐2, pp. 575–585.

Correia, Gonçalo M., Vlad Niculae, and André FT Martins (2019). “Adaptively Sparse Transformers”. In: Proc. EMNLP.

Corro, Caio and Ivan Titov (2019). “Learning latent trees with stochastic perturbations and differentiable dynamic programming”.In: Proc. of ACL.

Danskin, John M (1966). “The theory of max‐min, with applications”. In: SIAM Journal on Applied Mathematics 14.4, pp. 641–664.

53

Page 197: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

References IIDantzig, George B, Alex Orden, and Philip Wolfe (1955). “The generalized simplex method for minimizing a linear form underlinear inequality restraints”. In: Pacific Journal of Mathematics 5.2, pp. 183–195.

Frank, Marguerite and Philip Wolfe (1956). “An algorithm for quadratic programming”. In: Nav. Res. Log. 3.1‐2, pp. 95–110.

Gould, Stephen et al. (2016). “On differentiating parameterized argmin and argmax problems with application to bi‐leveloptimization”. In: preprint arXiv:1607.05447.

Grünwald, Peter D and A Philip Dawid (2004). “Game theory, maximum entropy, minimum discrepancy and robust Bayesiandecision theory”. In: Annals of Statistics, pp. 1367–1433.

Held, Michael, Philip Wolfe, and Harlan P Crowder (1974). “Validation of subgradient optimization”. In:Mathematical Programming6.1, pp. 62–88.

Hill, Felix et al. (2016). “Learning to understand phrases by embedding the dictionary”. In: TACL 4.1, pp. 17–30.

Kim, Yoon et al. (2017). “Structured attention networks”. In: Proc. of ICLR.

Kipf, Thomas, Elise van der Pol, and Max Welling (2020). “Contrastive Learning of Structured World Models”. In: Proc. of ICLR.

Kipf, Thomas and Max Welling (2017). “Semi‐supervised classification with graph convolutional networks”. In: Proc. of ICLR.

54

Page 198: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

References IIIKoo, Terry et al. (2007). “Structured prediction models via the matrix‐tree theorem”. In: Proc. of EMNLP.

Kuhn, Harold W (1955). “The Hungarian method for the assignment problem”. In: Nav. Res. Log. 2.1‐2, pp. 83–97.

Lacoste‐Julien, Simon and Martin Jaggi (2015). “On the global linear convergence of Frank‐Wolfe optimization variants”. In: Proc.of NeurIPS.

Liu, Yang and Mirella Lapata (2018). “Learning structured text representations”. In: TACL 6, pp. 63–75.

Long, Jonathan, Evan Shelhamer, and Trevor Darrell (2015). “Fully convolutional networks for semantic segmentation”. In: Proc. ofCVPR.

Martins, André FT and Ramón Fernandez Astudillo (2016). “From softmax to sparsemax: A sparse model of attention andmulti‐label classification”. In: Proc. of ICML.

Martins, André FT, Mário AT Figueiredo, et al. (2015). “AD3: Alternating directions dual decomposition for MAP inference ingraphical models”. In: JMLR 16.1, pp. 495–545.

McDonald, Ryan T and Giorgio Satta (2007). “On the complexity of non‐projective data‐driven dependency parsing”. In: Proc. ofICPT.

Nangia, Nikita and Samuel Bowman (2018). “ListOps: A diagnostic dataset for latent tree learning”. In: Proc. of NAACL SRW.

55

Page 199: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

References IVNiculae, Vlad and Mathieu Blondel (2017). “A regularized framework for sparse and structured neural attention”. In: Proc. of

NeurIPS.Niculae, Vlad and André FT Martins (2020). “LP‐SparseMAP: Differentiable relaxed optimization for sparse structuredprediction”. In: preprint arXiv:2001.04437.

Niculae, Vlad, André FT Martins, Mathieu Blondel, et al. (2018). “SparseMAP: Differentiable sparse structured inference”. In: Proc.of ICML.

Niculae, Vlad, André FT Martins, and Claire Cardie (2018). “Towards dynamic computation graphs via sparse latent structure”. In:Proc. of EMNLP.

Nocedal, Jorge and Stephen Wright (1999). Numerical Optimization. Springer New York.

Parikh, Ankur et al. (2016). “A decomposable attention model for natural language inference”. In: Proc. of EMNLP.

Peters, Ben, Vlad Niculae, and André FT Martins (2019). “Sparse sequence‐to‐sequence models”. In: Proc. ACL.

Rabiner, Lawrence R. (1989). “A tutorial on Hidden Markov Models and selected applications in speech recognition”. In: P. IEEE77.2, pp. 257–286.

Smith, David A and Noah A Smith (2007). “Probabilistic models of nonprojective dependency trees”. In: Proc. of EMNLP.

56

Page 200: vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP

References VTai, Kai Sheng, Richard Socher, and Christopher D Manning (2015). “Improved semantic representations from tree‐structuredLong Short‐Term Memory networks”. In: Proc. of ACL‐IJCNLP.

Taskar, Ben (2004). “Learning structured prediction models: A large margin approach”. PhD thesis. Stanford University.

Tibshirani, Robert et al. (2005). “Sparsity and smoothness via the fused lasso”. In: Journal of the Royal Statistical Society: Series B(Statistical Methodology) 67.1, pp. 91–108.

Tsallis, Constantino (1988). “Possible generalization of Boltzmann‐Gibbs statistics”. In: Journal of Statistical Physics 52,pp. 479–487.

Valiant, Leslie G (1979). “The complexity of computing the permanent”. In: Theor. Comput. Sci. 8.2, pp. 189–201.

Wainwright, Martin J and Michael I Jordan (2008). Graphical models, exponential families, and variational inference. Vol. 1. 1–2. NowPublishers, Inc., pp. 1–305.

Williams, Adina, Nikita Nangia, and Samuel R Bowman (2017). “A broad‐coverage challenge corpus for sentence understandingthrough inference”. In: preprint arXiv:1704.05426.

Wolfe, Philip (1976). “Finding the nearest point in a polytope”. In:Mathematical Programming 11.1, pp. 128–149.

57