deep generative models for nlp - github pages · why generative models? deep learning prorich...

69
Deep Generative Models for NLP Miguel Rios April 18, 2019

Upload: others

Post on 18-Oct-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Deep Generative Models for NLP

Miguel Rios

April 18, 2019

Page 2: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Content

Generative models

Exact Marginal

Intractable marginalisation

DGM4NLP

1 / 30

Page 3: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Why generative models?Deep Learning

pro Rich non-linear models for classification and sequenceprediction.

pro Scalable learning using stochastic approximation andconceptually simple.

con Only point estimates.con Hard to score models, do selection and complexity

penalisation.

Probabilistic modelling

pro Unified framework for model building, inference, predictionand decision making.

pro Explicit accounting for uncertainty and variability ofpredictions.

pro Robust to over-fitting.pro Offers tools for model selection and composition.con Potentially intractable inference,con computationally expensivecon long simulation time.

2 / 30

Page 4: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Why generative models?Deep Learning

pro Rich non-linear models for classification and sequenceprediction.

pro Scalable learning using stochastic approximation andconceptually simple.

con Only point estimates.con Hard to score models, do selection and complexity

penalisation.

Probabilistic modelling

pro Unified framework for model building, inference, predictionand decision making.

pro Explicit accounting for uncertainty and variability ofpredictions.

pro Robust to over-fitting.pro Offers tools for model selection and composition.con Potentially intractable inference,con computationally expensivecon long simulation time.

2 / 30

Page 5: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Why generative models?Deep Learning

pro Rich non-linear models for classification and sequenceprediction.

pro Scalable learning using stochastic approximation andconceptually simple.

con Only point estimates.

con Hard to score models, do selection and complexitypenalisation.

Probabilistic modelling

pro Unified framework for model building, inference, predictionand decision making.

pro Explicit accounting for uncertainty and variability ofpredictions.

pro Robust to over-fitting.pro Offers tools for model selection and composition.con Potentially intractable inference,con computationally expensivecon long simulation time.

2 / 30

Page 6: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Why generative models?Deep Learning

pro Rich non-linear models for classification and sequenceprediction.

pro Scalable learning using stochastic approximation andconceptually simple.

con Only point estimates.con Hard to score models, do selection and complexity

penalisation.

Probabilistic modelling

pro Unified framework for model building, inference, predictionand decision making.

pro Explicit accounting for uncertainty and variability ofpredictions.

pro Robust to over-fitting.pro Offers tools for model selection and composition.con Potentially intractable inference,con computationally expensivecon long simulation time.

2 / 30

Page 7: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Why generative models?Deep Learning

pro Rich non-linear models for classification and sequenceprediction.

pro Scalable learning using stochastic approximation andconceptually simple.

con Only point estimates.con Hard to score models, do selection and complexity

penalisation.

Probabilistic modelling

pro Unified framework for model building, inference, predictionand decision making.

pro Explicit accounting for uncertainty and variability ofpredictions.

pro Robust to over-fitting.pro Offers tools for model selection and composition.con Potentially intractable inference,con computationally expensivecon long simulation time.

2 / 30

Page 8: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Why generative models?Deep Learning

pro Rich non-linear models for classification and sequenceprediction.

pro Scalable learning using stochastic approximation andconceptually simple.

con Only point estimates.con Hard to score models, do selection and complexity

penalisation.

Probabilistic modelling

pro Unified framework for model building, inference, predictionand decision making.

pro Explicit accounting for uncertainty and variability ofpredictions.

pro Robust to over-fitting.pro Offers tools for model selection and composition.con Potentially intractable inference,con computationally expensivecon long simulation time.

2 / 30

Page 9: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Why generative models?Deep Learning

pro Rich non-linear models for classification and sequenceprediction.

pro Scalable learning using stochastic approximation andconceptually simple.

con Only point estimates.con Hard to score models, do selection and complexity

penalisation.

Probabilistic modelling

pro Unified framework for model building, inference, predictionand decision making.

pro Explicit accounting for uncertainty and variability ofpredictions.

pro Robust to over-fitting.

pro Offers tools for model selection and composition.con Potentially intractable inference,con computationally expensivecon long simulation time.

2 / 30

Page 10: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Why generative models?Deep Learning

pro Rich non-linear models for classification and sequenceprediction.

pro Scalable learning using stochastic approximation andconceptually simple.

con Only point estimates.con Hard to score models, do selection and complexity

penalisation.

Probabilistic modelling

pro Unified framework for model building, inference, predictionand decision making.

pro Explicit accounting for uncertainty and variability ofpredictions.

pro Robust to over-fitting.pro Offers tools for model selection and composition.

con Potentially intractable inference,con computationally expensivecon long simulation time.

2 / 30

Page 11: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Why generative models?Deep Learning

pro Rich non-linear models for classification and sequenceprediction.

pro Scalable learning using stochastic approximation andconceptually simple.

con Only point estimates.con Hard to score models, do selection and complexity

penalisation.

Probabilistic modelling

pro Unified framework for model building, inference, predictionand decision making.

pro Explicit accounting for uncertainty and variability ofpredictions.

pro Robust to over-fitting.pro Offers tools for model selection and composition.con Potentially intractable inference,

con computationally expensivecon long simulation time.

2 / 30

Page 12: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Why generative models?Deep Learning

pro Rich non-linear models for classification and sequenceprediction.

pro Scalable learning using stochastic approximation andconceptually simple.

con Only point estimates.con Hard to score models, do selection and complexity

penalisation.

Probabilistic modelling

pro Unified framework for model building, inference, predictionand decision making.

pro Explicit accounting for uncertainty and variability ofpredictions.

pro Robust to over-fitting.pro Offers tools for model selection and composition.con Potentially intractable inference,con computationally expensive

con long simulation time.

2 / 30

Page 13: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Why generative models?Deep Learning

pro Rich non-linear models for classification and sequenceprediction.

pro Scalable learning using stochastic approximation andconceptually simple.

con Only point estimates.con Hard to score models, do selection and complexity

penalisation.

Probabilistic modelling

pro Unified framework for model building, inference, predictionand decision making.

pro Explicit accounting for uncertainty and variability ofpredictions.

pro Robust to over-fitting.pro Offers tools for model selection and composition.con Potentially intractable inference,con computationally expensivecon long simulation time. 2 / 30

Page 14: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Why generative models?

I Lack of training data.

I Partial supervision.

I Lack of inductive bias.

3 / 30

Page 15: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Why generative models?

I Lack of training data.

I Partial supervision.

I Lack of inductive bias.

3 / 30

Page 16: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Why generative models?

I Lack of training data.

I Partial supervision.

I Lack of inductive bias.

3 / 30

Page 17: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

PGM

I Inference in graphical models is the problem of computing aconditional probability distribution over the values of some ofthe nodes.

I We also want to compute marginal probabilities in graphicalmodels, in particular the probability of the observed evidence.

I A latent variable model is a probabilistic model over observedand latent random variables.

I For a latent variable we do not have any observations.

4 / 30

Page 18: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

PGM

I Inference in graphical models is the problem of computing aconditional probability distribution over the values of some ofthe nodes.

I We also want to compute marginal probabilities in graphicalmodels, in particular the probability of the observed evidence.

I A latent variable model is a probabilistic model over observedand latent random variables.

I For a latent variable we do not have any observations.

4 / 30

Page 19: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

PGM

I Inference in graphical models is the problem of computing aconditional probability distribution over the values of some ofthe nodes.

I We also want to compute marginal probabilities in graphicalmodels, in particular the probability of the observed evidence.

I A latent variable model is a probabilistic model over observedand latent random variables.

I For a latent variable we do not have any observations.

4 / 30

Page 20: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

PGM

I Inference in graphical models is the problem of computing aconditional probability distribution over the values of some ofthe nodes.

I We also want to compute marginal probabilities in graphicalmodels, in particular the probability of the observed evidence.

I A latent variable model is a probabilistic model over observedand latent random variables.

I For a latent variable we do not have any observations.

4 / 30

Page 21: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Content

Generative models

Exact Marginal

Intractable marginalisation

DGM4NLP

5 / 30

Page 22: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

IBM 1-2

Latent alignment

I Count-based models with EM is attempting to find themaximum-likelihood estimates for the data.

I Feature-rich Models (NN to combine features).

I Bayesian parametrisation of IBM.

6 / 30

Page 23: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

IBM1: incomplete-data likelihood

f

a

m

el0

l

m

S

Incomplete-data likelihood

p(fm1 |el0) =l∑

a1=0

· · ·l∑

am=0

p(fm1 , am1 |eaj ) (1)

=

l∑a1=0

· · ·l∑

am=0

n∏j=1

p(aj |l,m)p(fj |eaj ) (2)

=

n∏j=1

l∑aj=0

p(aj |l,m)p(fj |eaj ) (3)

7 / 30

Page 24: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

IBM1: posterior

Posterior

p(am1 |fm1 , el0) =p(fm1 , a

m1 |el0)

p(fm1 |el0)(4)

Factorised

p(aj |fm1 , el0) =p(aj |l,m)p(fj |eaj )∑li=0 p(i|l,m)p(fj |ei)

(5)

8 / 30

Page 25: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

MLE via EM

E-step:

E[n(e→ f|am1 )] =

l∑a1=0

· · ·l∑

am=0

p(am1 |fm1 , el0)n(e→ f|Am1 ) (6)

=

l∑a1=0

· · ·l∑

am=0

m∏j=1

p(aj |fm1 , el0)1e(eaj )1f(fj) (7)

=

m∏j=1

l∑i=0

p(aj = i|fm1 , el0)1e(ei)1f(fj) (8)

M-step:

θe,f =E[n(e→ f |am1 )]∑f ′ E[n(e→ f ′|am1 )]

(9)

9 / 30

Page 26: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

IBM 1-2: strong assumptions

Independence assumptions

I p(a|m,n) does not depend on lexical choicesa1 cute2 house3 ↔ una1 casa3 bella2

a1 cosy2 house3 ↔ una1 casa3 confortable2I p(f |e) can only reasonably explain one-to-one alignments

I will be leaving soon ↔ voy a salir pronto

Parameterisation

I categorical events are unrelatedprefixes/suffixes: normal, normally, abnormally, . . .verb inflections: comer, comi, comia, comio, . . .gender/number: gato, gatos, gata, gatas, . . .

10 / 30

Page 27: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

IBM 1-2: strong assumptions

Independence assumptions

I p(a|m,n) does not depend on lexical choicesa1 cute2 house3 ↔ una1 casa3 bella2a1 cosy2 house3 ↔ una1 casa3 confortable2

I p(f |e) can only reasonably explain one-to-one alignmentsI will be leaving soon ↔ voy a salir pronto

Parameterisation

I categorical events are unrelatedprefixes/suffixes: normal, normally, abnormally, . . .verb inflections: comer, comi, comia, comio, . . .gender/number: gato, gatos, gata, gatas, . . .

10 / 30

Page 28: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

IBM 1-2: strong assumptions

Independence assumptions

I p(a|m,n) does not depend on lexical choicesa1 cute2 house3 ↔ una1 casa3 bella2a1 cosy2 house3 ↔ una1 casa3 confortable2

I p(f |e) can only reasonably explain one-to-one alignmentsI will be leaving soon ↔ voy a salir pronto

Parameterisation

I categorical events are unrelatedprefixes/suffixes: normal, normally, abnormally, . . .verb inflections: comer, comi, comia, comio, . . .gender/number: gato, gatos, gata, gatas, . . .

10 / 30

Page 29: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

IBM 1-2: strong assumptions

Independence assumptions

I p(a|m,n) does not depend on lexical choicesa1 cute2 house3 ↔ una1 casa3 bella2a1 cosy2 house3 ↔ una1 casa3 confortable2

I p(f |e) can only reasonably explain one-to-one alignmentsI will be leaving soon ↔ voy a salir pronto

Parameterisation

I categorical events are unrelatedprefixes/suffixes: normal, normally, abnormally, . . .verb inflections: comer, comi, comia, comio, . . .gender/number: gato, gatos, gata, gatas, . . .

10 / 30

Page 30: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Berg-Kirkpatrick et al. [2010]

Lexical distribution in IBM model 1

p(f |e) =exp(w>lexhlex(e, f))∑f ′ exp(w

>lexhlex(e, f

′))(10)

Features

I f ∈ VF is a French word (decision), e ∈ VE is an English word(conditioning context), w ∈ Rd is the parameter vector, andh : VF × VE → Rd is a feature vector function.

I prefixes/suffixes

I character n-grams

I POS tags

I Learning using these combination features, e.g. neuralnetworks

11 / 30

Page 31: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Neural IBM

I fθ(e) = softmax(WtHE(e) + bt) note that the softmax isnecessary to make tθ produce valid parameters for thecategorical distributionWt ∈ R|VF |×dh and bt ∈ R|VF |

12 / 30

Page 32: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

MLE

I We still need to be able to express the functional form of thelikelihood.

I Let us then express the log-likelihood (which is the objectivewe maximise in MLE) of a single sentence pair as a functionof our free parameters:

L(θ|em0 , fn1 ) = log pθ(fm1 |el0) (11)

I p(f |e) =∏j p(fj |e) =

∏j

∑ajp(aj |m, l)p(fj |eaj )

I Note that in fact our log-likelihood is a sum of independentterms Lj(θ|em0 , fj), thus we can characterise the contributionof each French word in each sentence pair

13 / 30

Page 33: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

MLE

I We still need to be able to express the functional form of thelikelihood.

I Let us then express the log-likelihood (which is the objectivewe maximise in MLE) of a single sentence pair as a functionof our free parameters:

L(θ|em0 , fn1 ) = log pθ(fm1 |el0) (11)

I p(f |e) =∏j p(fj |e) =

∏j

∑ajp(aj |m, l)p(fj |eaj )

I Note that in fact our log-likelihood is a sum of independentterms Lj(θ|em0 , fj), thus we can characterise the contributionof each French word in each sentence pair

13 / 30

Page 34: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

MLE

I We still need to be able to express the functional form of thelikelihood.

I Let us then express the log-likelihood (which is the objectivewe maximise in MLE) of a single sentence pair as a functionof our free parameters:

L(θ|em0 , fn1 ) = log pθ(fm1 |el0) (11)

I p(f |e) =∏j p(fj |e) =

∏j

∑ajp(aj |m, l)p(fj |eaj )

I Note that in fact our log-likelihood is a sum of independentterms Lj(θ|em0 , fj), thus we can characterise the contributionof each French word in each sentence pair

13 / 30

Page 35: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

MLE

I We still need to be able to express the functional form of thelikelihood.

I Let us then express the log-likelihood (which is the objectivewe maximise in MLE) of a single sentence pair as a functionof our free parameters:

L(θ|em0 , fn1 ) = log pθ(fm1 |el0) (11)

I p(f |e) =∏j p(fj |e) =

∏j

∑ajp(aj |m, l)p(fj |eaj )

I Note that in fact our log-likelihood is a sum of independentterms Lj(θ|em0 , fj), thus we can characterise the contributionof each French word in each sentence pair

13 / 30

Page 36: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Content

Generative models

Exact Marginal

Intractable marginalisation

DGM4NLP

14 / 30

Page 37: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Variational Inference

I We assume that x = xn1 are observations and z = zn1 arehidden continuous variables.We assume additional parameters θ that are fixed.

I We interested in performing MLE learning of the parameters θ.

I This requires marginalization over the unobserved latentvariables z.

I However this integration is intractable:

pθ(x) =

∫pθ(x|z)pθ(z)dz (12)

I We are also interested on the posterior inference for the latentvariable:

p(z|x) = p(x, z)

p(x)(13)

15 / 30

Page 38: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Variational Inference

I We assume that x = xn1 are observations and z = zn1 arehidden continuous variables.We assume additional parameters θ that are fixed.

I We interested in performing MLE learning of the parameters θ.

I This requires marginalization over the unobserved latentvariables z.

I However this integration is intractable:

pθ(x) =

∫pθ(x|z)pθ(z)dz (12)

I We are also interested on the posterior inference for the latentvariable:

p(z|x) = p(x, z)

p(x)(13)

15 / 30

Page 39: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Variational Inference

I We assume that x = xn1 are observations and z = zn1 arehidden continuous variables.We assume additional parameters θ that are fixed.

I We interested in performing MLE learning of the parameters θ.

I This requires marginalization over the unobserved latentvariables z.

I However this integration is intractable:

pθ(x) =

∫pθ(x|z)pθ(z)dz (12)

I We are also interested on the posterior inference for the latentvariable:

p(z|x) = p(x, z)

p(x)(13)

15 / 30

Page 40: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Variational Inference

I We assume that x = xn1 are observations and z = zn1 arehidden continuous variables.We assume additional parameters θ that are fixed.

I We interested in performing MLE learning of the parameters θ.

I This requires marginalization over the unobserved latentvariables z.

I However this integration is intractable:

pθ(x) =

∫pθ(x|z)pθ(z)dz (12)

I We are also interested on the posterior inference for the latentvariable:

p(z|x) = p(x, z)

p(x)(13)

15 / 30

Page 41: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Variational Inference

I We assume that x = xn1 are observations and z = zn1 arehidden continuous variables.We assume additional parameters θ that are fixed.

I We interested in performing MLE learning of the parameters θ.

I This requires marginalization over the unobserved latentvariables z.

I However this integration is intractable:

pθ(x) =

∫pθ(x|z)pθ(z)dz (12)

I We are also interested on the posterior inference for the latentvariable:

p(z|x) = p(x, z)

p(x)(13)

15 / 30

Page 42: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Variational Inference

I [Jordan et al., 1999] introduce a variational approximationq(z|x) to the true posterior

I The objective is to pick a family of distributions over thelatent variables with its own variational parameters,qφ(z)

I Then, we find the parameters that makes q close to the trueposterior

I We use q with the fitted variational parameters as a proxy forthe true posteriore.g., to form predictions about future data or to investigatethe posterior distribution of the latent variables.

16 / 30

Page 43: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Variational Inference

I [Jordan et al., 1999] introduce a variational approximationq(z|x) to the true posterior

I The objective is to pick a family of distributions over thelatent variables with its own variational parameters,qφ(z)

I Then, we find the parameters that makes q close to the trueposterior

I We use q with the fitted variational parameters as a proxy forthe true posteriore.g., to form predictions about future data or to investigatethe posterior distribution of the latent variables.

16 / 30

Page 44: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Variational Inference

I [Jordan et al., 1999] introduce a variational approximationq(z|x) to the true posterior

I The objective is to pick a family of distributions over thelatent variables with its own variational parameters,qφ(z)

I Then, we find the parameters that makes q close to the trueposterior

I We use q with the fitted variational parameters as a proxy forthe true posteriore.g., to form predictions about future data or to investigatethe posterior distribution of the latent variables.

16 / 30

Page 45: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Variational Inference

I [Jordan et al., 1999] introduce a variational approximationq(z|x) to the true posterior

I The objective is to pick a family of distributions over thelatent variables with its own variational parameters,qφ(z)

I Then, we find the parameters that makes q close to the trueposterior

I We use q with the fitted variational parameters as a proxy forthe true posteriore.g., to form predictions about future data or to investigatethe posterior distribution of the latent variables.

16 / 30

Page 46: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Variational InferenceWe optimise φinit in order to minimize the KL to get qφ(z) closerto the true posterior:

17 / 30

Page 47: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

KL divergence

I We measure the closeness of two distributions withKullback-Leibler (KL) divergence.

I We focus KL variational inference [Blei et al., 2016], wherethe KL divergence between q(z) and p(z|x) is optimised.

KL(q‖p) = Eq

[log

q(z)

p(z|x)

](14)

I We can not minimize the KL divergence exactly, but we canmaximise a lower bound on the marginal likelihood.

18 / 30

Page 48: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

KL divergence

I We measure the closeness of two distributions withKullback-Leibler (KL) divergence.

I We focus KL variational inference [Blei et al., 2016], wherethe KL divergence between q(z) and p(z|x) is optimised.

KL(q‖p) = Eq

[log

q(z)

p(z|x)

](14)

I We can not minimize the KL divergence exactly, but we canmaximise a lower bound on the marginal likelihood.

18 / 30

Page 49: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

KL divergence

I We measure the closeness of two distributions withKullback-Leibler (KL) divergence.

I We focus KL variational inference [Blei et al., 2016], wherethe KL divergence between q(z) and p(z|x) is optimised.

KL(q‖p) = Eq

[log

q(z)

p(z|x)

](14)

I We can not minimize the KL divergence exactly, but we canmaximise a lower bound on the marginal likelihood.

18 / 30

Page 50: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Evidence lower bound

I If we use the Jensens inequality applied to probability distributions.When f is concave,f(E[X]) ≥ E[f(X)]

I We use Jensens inequality on the log probability of the observationsThis is the evidence lower bound (ELBO):

log pθ(x) = log

∫pθ(x|z)pθ(z)dz

= log

∫qφ(z)

qφ(z)pθ(x|z)pθ(z)dz

= logEq[pθ(x|z)pθ(z)

qφ(z)

]≥ Eq

[log

pθ(x|z)pθ(z)qφ(z)

]= Eq

[log

pθ(z)

qφ(z)

]+ Eq [log pθ(x|z)]

= −KL (qφ(z)‖pθ(z)) + Eq [log pθ(x|z)]= L(θ, φ|x)

(15)

19 / 30

Page 51: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Evidence lower bound

I If we use the Jensens inequality applied to probability distributions.When f is concave,f(E[X]) ≥ E[f(X)]

I We use Jensens inequality on the log probability of the observationsThis is the evidence lower bound (ELBO):

log pθ(x) = log

∫pθ(x|z)pθ(z)dz

= log

∫qφ(z)

qφ(z)pθ(x|z)pθ(z)dz

= logEq[pθ(x|z)pθ(z)

qφ(z)

]≥ Eq

[log

pθ(x|z)pθ(z)qφ(z)

]= Eq

[log

pθ(z)

qφ(z)

]+ Eq [log pθ(x|z)]

= −KL (qφ(z)‖pθ(z)) + Eq [log pθ(x|z)]= L(θ, φ|x)

(15)

19 / 30

Page 52: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

ELBO

I The objective is to do optimization of the function qφ(z) tomaximize the ELBO:

KL (qφ(z)‖pθ(z|x)) = Eq[log

qφ(z)

pθ(z|x)

]= Eq [log qφ(z)− log pθ(z|x)]

= Eq[log qφ(z)− log

pθ(x|z)pθ(z)pθ(x)

]= Eq

[log

qφ(z)

pθ(z)

]− Eqz [log pθ(x|z)] + log pθ(x)

= −L(θ, φ|x) + log pθ(x)(16)

20 / 30

Page 53: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Evidence lower bound

I To denote a lower bound on the log marginal likelihood:

log pθ(x) ≥ log pθ(x)−KL (qφ(z|x)‖pθ(z|x))= Eq [log pθ(x|z)]−KL(qφ(z|x)‖p(z))

(17)

I It lower-bounds the marginal distribution of x

21 / 30

Page 54: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Evidence lower bound

I To denote a lower bound on the log marginal likelihood:

log pθ(x) ≥ log pθ(x)−KL (qφ(z|x)‖pθ(z|x))= Eq [log pθ(x|z)]−KL(qφ(z|x)‖p(z))

(17)

I It lower-bounds the marginal distribution of x

21 / 30

Page 55: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Mean Field

I We assume that the variational family factorises:

q (z0, . . . , zN ) =

N∏i

q (zi) (18)

I This simplification make optimisation and inference with VItractable

22 / 30

Page 56: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Mean Field

I We assume that the variational family factorises:

q (z0, . . . , zN ) =

N∏i

q (zi) (18)

I This simplification make optimisation and inference with VItractable

22 / 30

Page 57: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Content

Generative models

Exact Marginal

Intractable marginalisation

DGM4NLP

23 / 30

Page 58: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Document modelling

I Know what topics are being discussed on Twitter and by whatdistribution they occur.

24 / 30

Page 59: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Word representation

25 / 30

Page 60: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Word representation

Generative model

I Embed words as probability densities.

I Add extra information about the context.e.g. translations as a proxy to sense annotation.

26 / 30

Page 61: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Natural Language Inference

Classification

I

I

27 / 30

Page 62: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Natural Language Inference

Classification

I

I

27 / 30

Page 63: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Natural Language Inference

Generative model

I Avoid over-fitting.

I Change of prior.

28 / 30

Page 64: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Natural Language Inference

Generative model

I Avoid over-fitting.

I Change of prior.

28 / 30

Page 65: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Natural Language Inference

Confidence of classification

I Bayesian NN

I We place a prior distribution over the model parameters p(θ)

29 / 30

Page 66: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Natural Language Inference

Confidence of classification

I Bayesian NN

I We place a prior distribution over the model parameters p(θ)

29 / 30

Page 67: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Neural Machine TranslationI

I

30 / 30

Page 68: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

Neural Machine TranslationI

I30 / 30

Page 69: Deep Generative Models for NLP - GitHub Pages · Why generative models? Deep Learning proRich non-linear models for classi cation and sequence prediction. proScalable learning using

References I

Taylor Berg-Kirkpatrick, Alexandre Bouchard-Cote, John DeNero,and Dan Klein. Painless unsupervised learning with features. InHuman Language Technologies: The 2010 Annual Conference ofthe North American Chapter of the Association forComputational Linguistics, pages 582–590, Los Angeles,California, June 2010. Association for Computational Linguistics.URL http://www.aclweb.org/anthology/N10-1083.

D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variationalinference: A review for statisticians. ArXiv e-prints, January2016.

MichaelI. Jordan, Zoubin Ghahramani, TommiS. Jaakkola, andLawrenceK. Saul. An introduction to variational methods forgraphical models. Machine Learning, 37(2):183–233, 1999.ISSN 0885-6125. doi: 10.1023/A:1007665907178. URLhttp://dx.doi.org/10.1023/A%3A1007665907178.

31 / 30