variational autoencoders and extensions...

Variational Autoencoders and extensions (Draft)

Suthee [email protected]

July 2016

Contents

0.1 Change Log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1 Variational Autoencoder 31.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3.1 Intractability . . . . . . . . . . . . . . . . . . . . . . . . . 41.3.2 A large dataset . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . 41.5 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.5.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.5.2 Variational Bound . . . . . . . . . . . . . . . . . . . . . . 5

1.6 Connection with Autoencoder . . . . . . . . . . . . . . . . . . . . 61.7 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . 6

1.7.1 Estimating by Sampling . . . . . . . . . . . . . . . . . . . 61.7.2 Reparameterization Trick . . . . . . . . . . . . . . . . . . 61.7.3 Stochastic Gradient Variational Bound (SGVB) . . . . . . 7

1.8 Variational Autoencoder . . . . . . . . . . . . . . . . . . . . . . . 71.8.1 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.8.2 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.8.3 Regularization term . . . . . . . . . . . . . . . . . . . . . 91.8.4 Put everything together . . . . . . . . . . . . . . . . . . . 9

2 Semi-supervised Learning 112.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.1 Latent-Feature Discriminative Model (M1) . . . . . . . . 112.3.2 Generative Semi-Supervised Model (M2) . . . . . . . . . . 122.3.3 Stacking Generative Semi-Supervised Model (M1+M2) . . 12

2.4 Lowerbound Objective . . . . . . . . . . . . . . . . . . . . . . . . 122.4.1 Latent Feature Discriminative Model Objective . . . . . . 122.4.2 Generative Semi-supervised Model Objective . . . . . . . 132.4.3 Classifier loss . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.5 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1

2.6 Classification Performance . . . . . . . . . . . . . . . . . . . . . . 14

0.1 Change Log

• July 2016

– Variational Autoencoder

– Semi-Supervised Learning

• Aug 2016

– add derivation of regularization term

– Importance Weighted Autoencoders

2

Chapter 1

Variational Autoencoder

1.1 Introduction

This note summarized the variational autoencoder described by Kingma DP.and Welling M. [1] This work introduces a probabilistic antoencoder that canlearn a posterior and likelihood distributions from the given data samples. Theyproposed a general framework of variational autoencoder and efficient inferencealgorithm to estimate the model parameters. This model shows a connectionbetween feedforward neural network and probabilistic generative model.

1.2 Problem Description

In generative model, we often assume that the observed data is generated by anunderlying random process. We are given X = {x(i)}Ni=1 as a set of observed,independent drawn data from an unknown distribution. We further assume thatthis distribution is govern by unobserved continuous random variable Z.

The generative model can now be described as:

• z(i) ∼ Pθ∗(z)

• x(i) ∼ Pθ∗(x|z)

Pθ∗(z) and Pθ∗(x|z) are parametric family of distribution of Pθ∗(z) andPθ∗(x|z). The probability density functions are differentiable w.r.t. θ and z.

However, the true parameter θ∗ and z(i) are unknown. Furthermore, in thereal-world dataset, the underlying distribution is often complex. The exponen-tial family and mean-field factoring assumption will usually constraint the formof the true distribution. Without these assumption, we will end up a distribu-tion with no closed-form solution, which can be difficult to perform a paraterestimation.

3

1.3 Goal

This work wants to solve two general problems:

1.3.1 Intractability

The task of integrating the marginal probability, Pθ(x) =∫Pθ(z)Pθ(x|z)dz is

intractable if there is no closed-form. The posterior Pθ(z|x) = Pθ(x|z)Pθ(z)Pθ(x)

is also

intractable because of the denominator term. As a result, EM and mean-fieldVB cannot be used. Is there a general algorithm that works efficiently?

1.3.2 A large dataset

There are too much data that even a batch gradient descent algorithm cannotperform efficiently. Thus, updating parameters using smaller number of samplesis desirable. A single data point is even better. MCEM will unfortunately bevery slow. They need to find algorithm that also works efficiently for largedataset.

1.4 Proposed Solution

This paper proposed the following solution:

1. Efficiently approximate ML or MAP estimation of θ ( Pθ(x|z) ). Themotivation is to approximate the underlying random process and be ableto generate artificial data.

2. Efficiently approximate posterior inference of latent variable z given ob-served value x which is parametrized by θ (Pθ(z|x)). This will allow us toencode x to different useful representations.

3. Efficiently approximate marginal inference of x (Pθ(x)). It possible to useP (x) as a prior to perform other inference tasks such as image denoising,inpainting, and super-resolution.

1.5 Setup

The evidence lowerbound will be derived in this section.

1.5.1 Notation

Here are notations:

• Pθ(z|x) - a true posterior distribution.

• qφ(z|x) - an approximate posterior distribution that will be learned fromthe data.

4

• Pθ(x|z) - a generative model that we want to learn from the data.

• Pθ(z) - a prior of latent variable z.

1.5.2 Variational Bound

Since Pθ(z|x) is unknown, we approximate it by finding a distribution qφ(z|x)that is as similar as the true posterior distribution. This assumption is strongerthan mean-field approximation.

The KL divergence is defined as DKL(q||p) =∫q(x) log q(x)

p(x)dx. One way to

derive the lowerbound is to find qφ(z|x) that is as close to Pθ(z|x) by minimizingthe KL-divergence of DKL

(qφ(z|x)||Pθ(z|x)

).

Here is the step of deriving the lowerbound:

DKL

(qφ(z|x)||Pθ(z|x)

)=

∫z

qφ(z|x) logqφ(z|x)

Pθ(z|x)

= Eqφ(z|x)[log qφ(z|x)− logPθ(z|x)]

= Eqφ(z|x)[log qφ(z|x)− logPθ(x|z)− logPθ(z) + logPθ(x)]

= logPθ(x) + Eqφ(z|x)[log qφ(z|x)− logPθ(x|z)− logPθ(z)]

We will rearrange it so that the left hand-side contains a data log-likelihood:

logPθ(x)−DKL

(qφ(z|x)||Pθ(z|x)

)= Eqφ(z|x)[− log qφ(z|x) + logPθ(x|z) + logPθ(z)]

logPθ(x) ≥ Eqφ(z|x)[− log qφ(z|x) + logPθ(x|z) + logPθ(z)]

= Eqφ(z|x)[logPθ(x|z)]− Eqφ(z|x)[− logPθ(z) + log qφ(z|x)]

= Eqφ(z|x)[logPθ(x|z)]− Eqφ(z|x)[logqφ(z|x)

Pθ(z)]

= Eqφ(z|x)[logPθ(x|z)]−DKL

(qφ(z|x)||Pθ(z)

)The second line, we removed KL-Divergence term since it is always positive.

Thus, the right hand-side is definitely less than or equal to the log-likelihood.The final expression is the variational lowerbound. Let’s define it properly here:

L(θ, φ;x) = −DKL

(qφ(z|x)||Pθ(z)

)+ Eqφ(z|x)[logPθ(x|z)] (1.1)

The above equation has two terms: the first term is the KL-divergencebetween approximate posterior and the prior of latent variable. This acts as aregularization term. The second term is an expected log-likelihood. This actsas a maximum likelihood - trying to find parameters that maximize/describedthe observed data x.

5

1.6 Connection with Autoencoder

We can interpret the lowerbound from equation 1.1 as an objective functionof autoencoder. First, the KL-divergence is a regularizer. This is a penaltyterm that prevent the autoencoder from overfitting the data by trying to stayas close to the prior as possible. The second term is can be viewed as a negativereconstruction error. The higher log-likelihood, the smaller reconstruction error.

From the information theory perspective, we can view qφ(z|x) as a proba-bilistic decoder that maps x to z. Then, Pθ(x|z) as a probabilistic encoder thatgenerates x from z.

1.7 Parameter Estimation

1.7.1 Estimating by Sampling

Parameter estimation can be tricky since there is no closed-form in the secondterm in equation 1.1. One approach to estimate the expectation term is byperforming MC estimation. If we sample a lot of z from qφ(z|x), the average ofz should be approximately equal to its expectation. Thus, the term:

Eqφ(z|x)[logPθ(x|z)] ≈1

L

L∑l=1

logPθ(x|z(l)) (1.2)

where z(l) ∼ qφ(z|x).However, this expression is not differentiable because we need to sample z

from a distribution that depends on the model parameter φ. Thus, one trickthat this paper has employed is called reparameterization trick. That is toremove this dependency from the sampling.

1.7.2 Reparameterization Trick

Initially, z ∼ qφ(z|x), we now assume that there is a transform function thatmaps an independent sample ε to z. For an instance, we want z = gφ(ε, x) whereε ∼ p(ε). We can choose an appropriate p(ε) and a function gφ(ε, x) as long asa transform function has a tractable inverse CDF of qφ(z|x), can be viewed asa location scale function, or composite function.

For example, to find an expectation of an arbitrary function f(z), we can doas the following:

Eqφ(z|x)[f(z)] = Ep(ε)[f(gφ(ε, x))]

≈ 1

L

L∑l=1

f(gφ(ε(l), x))

6

1.7.3 Stochastic Gradient Variational Bound (SGVB)

The author use a stochastic gradient descent for parameter estimation. Recallthe lowerbound from equation 1.1:

L̃A(θ, φ;x) =1

L

L∑l=1

logPθ(x, z(l))− log qφ(z(l)|x) (1.3)

Where z(l) = gφ(ε(l), x) and ε(l) ∼ p(ε).This equation requires MC estimation for both likelihood and posterior

terms. To reduce the variance from MC estimation, the KL-divergence termcan be integrated analytically. Thus, reduce the estimation to only one term.According to equation 1.1, the estimated lowerbound is:

L̃B(θ, φ;x) = −DKL(qφ(z|x)||Pθ(z)) +1

L

L∑l=1

logPθ(x|z(l)) (1.4)

For full dataset, the lowerbound is an average over all data point. We canuse mini-batch upto M data points, this will give the lower estimator of:

L(θ, φ,X) ≈ L̃M (θ, φ;XM )

=N

M

M∑i=1

L̃(θ, φ, x(i))

=N

M

M∑i=1

{−DKL(qφ(z|x)||Pθ(z)) +

1

L

L∑l=1

logPθ(x|z(l))}

(1.5)

The following is the pseudo code for SGVB algorithm:

Algorithm 1 SGVB algorithm

θ, φ← init paramwhile until convergence do

XM ← draw M data pointsε← sample from p(ε)g ← compute gradient ∇θ,φL̃M (θ, φ;XM , ε)θ, φ← update using SGD

end while

1.8 Variational Autoencoder

This section demonstrates how neural network is used to approximate the pos-terior of the generative model.

7

1.8.1 Encoder

We assume the form of qφ(z|x) as Gaussian distribution with mean and variancesare estimated from x. We use a deep neural network to perform a non-linearfunction that maps x to mean and variance matrices as following:

qφ(z|x) = N (z;µz(x,φ),σ2z(x,φ)) (1.6)

We use a subscript z1 to denote the first dimension of latent variable z. Wewant to learn non-linear functions µzi and σ2

zi by using neural network. Wedefine these two functions as following:

µzi(x, φ) = W(l)µzi

h(l) + b(l)µzi(1.7)

σzi(x, φ) = W(l)σzi

h(l) + b(l)σzi(1.8)

Where each hidden units are calculated by:

h(l)i = tanh

(W

(l−1)i h(l−1) + b

(l−1)i

)(1.9)

Finally, we set h(1)i as xi.

1.8.2 Decoder

For decoder, we will assume a generative distribution as Gaussian similar to thedecoder. The derivation is similar to encoder but we need to add reparameter-ization as well.

Pθ(x|z) = N (x;µx1(z, θ), σ2

x1(z, θ)) (1.10)

Both mean and variance functions are defined as:

µxi(z, θ) = W(l′)µxi

h(l′) + b(l′)

µxi(1.11)

σxi(z, θ) = W(l′)σxi

h(l′) + b(l′)

σxi(1.12)

The hidden units are defined as:

h(l′)j = tanh

(W

(l′−1)j h(l′−1) + b

(l′−1)j

)(1.13)

But now, the first layer of hidden unit is set as h(1)j = zj Then, we reparam-

eterize z so that z = µz(x, φ) + σz(x, φ)� ε

8

1.8.3 Regularization term

The KL-Divergence term in eq.1.5 has an analytical solution since both prior andapproximate distribution are both Gaussian. Following the same derivation from[1], the term −DKL(qφ(z|x)||Pθ(z)) is equivalent to 1

2

∑Jj=1

(1 + log((σj)

2) −(µj)

2 − (σj)2). The j-index is a dimensional index.

Here is a derivation for a simple case of 1D gaussian:∫qθ(z) log qθ(z)dz =

∫N (z;µ,σ2) logN (z;µ,σ2)dz

= Ez[

logN (z;µ,σ2)dz]

= Ez[

log1√

2πσ2exp

{− (z − µ)2

2σ2

}]= −1

2log 2π − 1

2logσ2 − 1

2σ2Ez[(z − µ)2

]= −1

2log 2π − 1

2logσ2 − 1

2

= −1

2log 2π − 1

2

(logσ2 + 1

)

(1.14)

We can generalize this result to multivariate Gaussian by using the fact thatthe covariance is a diagonal matrix. Thus, each feature dimension is indepen-dent. The log-likelihood becomes a sum of log-likelihood of each dimensionparameterized by µj and σj .

1.8.4 Put everything together

The lowerbound estimator from eq.1.5 has the following form:

L(θ, φ,X) ≈ L̃M (θ, φ;XM )

=N

M

M∑i=1

{−DKL(qφ(z|x)||Pθ(z)) +

1

L

L∑l=1

logPθ(x|z(l))}

=N

M

M∑i=1

{1

2

J∑j=1

(1 + log((σj)

2)− (µj)2 − (σj)

2)

+1

L

L∑l=1

logPθ(x|z(l))}

We use backpropagation to derive the gradient of the lowerbound and useSGD to estimate the parameter θ and φ.

9

Figure 1.1: A 2-D manifold learned from MNIST dataset

Figure 1.2: A generated digit after the model learned MNIST dataset

10

Chapter 2

Semi-supervised Learning

2.1 Introduction

We will investigate the first natural extension of variational autoencoder byadding supervised component to the generative model. I summarized the paper,”Semi-supervised Learning with Deep Generative Models” by Kingma P et al.[2]. This paper demonstrates that the additional label data can create a betterrepresentation and can be used for classification tasks.

2.2 Problems

In the classification problem, we seek for a decision boundary that is accuratelyseparated one group of data from others. In the discriminative model, we usea pair of data and labels to construct this boundary. However, many datasetare tempered by the fact that as data is growing, the labels are more expen-sive to obtain. The Semi-supervised learning approach is then used to utilizeboth labels and non-labelled information in order to improve the classificationperformance.

2.3 Models

This paper introduces a neural variational framework for a semi-supervisedlearning task. In generarl, there are 2 models: the first model is taken di-rectly from the original work by Kingma & Welling, which is finding mappingfrom input data into a latent space. The second model is their contribution.This model is a genarative semi-supervised model. This model assumes thatthe data is generated by both latent variable and label.

2.3.1 Latent-Feature Discriminative Model (M1)

We summarized the geneartive process here:

11

• z ∼ N (z|0, I)

• x ∼ p(x|z) = f(x; z, θ)

The variable x is drawn from a conditional distribution, which can be a non-linear function that is formed a neural network. This model is taken directlyfrom the original variational autoencoder paper.

2.3.2 Generative Semi-Supervised Model (M2)

This model incorporate labels to find a conditional mapping from data to thelatent space. The generative model is described below:

• y ∼ Cat(y|π)

• z ∼ N (z|0, I)

• x ∼ Pθ(x|y, z) = f(x; y, z, θ)

First, the label, y, is drawn from a multinomial distribution, parameterizedby π. Then, we draw a latent representation, z from a spherical Gaussiandistribution. Finally, x is then drawn from a non-linear likelihood function.

2.3.3 Stacking Generative Semi-Supervised Model (M1+M2)

They stack both models by first learn latent variable z1 from M1, then learnM2 using z1 instead of x. This model is an extension of M2.

2.4 Lowerbound Objective

In this section we will derive the lowerbound for both models: M1 and M2.

2.4.1 Latent Feature Discriminative Model Objective

By following the conventional lowerbound, the lowerbound is a different be-tween an expected log-likelihood and the KL-divergence between approximatedposterior and prior.

log pθ(x) ≥ Eqφ(z|x)[log pθ(x|z)]−Dkl(qφ(z|x)||pθ(z)) = −J (x) (2.1)

This model does not consider the label and simply attempt to model therelationship between the data and its corresponding latent structure.

12

2.4.2 Generative Semi-supervised Model Objective

When we consider the label, there are two cases whether the label correspondingto a data is observed or unobserved. For the observed case, the joint distributiononly has z as a latent variable.

log pθx = Eqφ(z|x,y)[log pθ(x, y, z)− logqφ(z|x, y)]

= Eqφ(z|x,y)[log pθ(x|y, z) + log pθ(y) + log p(z)− logqφ(z|x, y)] (2.2)

= −L(x, y)

According to the generative process, y and z are conditional independentgiven x. For an unobserved label, we treat y as a latent variable. The approxi-mate posterior, qφ(y, z|x) has a factorised form, qφ(y, z|x) = qφ(z|x, y)qφ(y|x):

log pθ(x) = Eqφ(y,z|x)[log pθ(x|y, z) + log pθ(y) + log p(z)− logqφ(y, z|x)]

= Eqφ(y,z|x)[log pθ(x|y, z) + log pθ(y) + log p(z)− logqφ(z|x, y)− logqφ(y|x)]

= Eqφ(y|x)[Eqφ(z|x,y)[log pθ(x|y, z) + log pθ(y) + log p(z)− logqφ(z|x, y)− logqφ(y|x)]]

= Eqφ(y|x)[Eqφ(z|x,y)[log pθ(x|y, z) + log pθ(y) + log p(z)− logqφ(z|x, y)]− logqφ(y|x)]

= Eqφ(y|x)[−L(x, y)]− Eqφ(y|x)[logqφ(y|x)]

= Eqφ(y|x)[−L(x, y)] +H(qφ(y|x))

=∑y

qφ(y|x){− L(x, y)

}+H(qφ(y|x)) (2.3)

= −U(x)

This result marginalizes y and compute the average log-likelihood instead.Marginalizing works here because the number of labels is not large. In thepaper, use MNIST dataset which has 10 labels.

The lowerbound is then a combination of two terms:

J =∑

(x,y)∼P̃l

L(x, y) +∑x∼P̃u

U(x) (2.4)

2.4.3 Classifier loss

A discriminative model qφ(y|x) is only applied to U(x) which is not a goodclassifier model because qφ(y|x) should also learn from labelled data. Thus, weadd classifier loss to equation 2.4:

J α = J + αEp̃l(x,y)[− log qφ(y|x)] (2.5)

The parameter α control the relative weight between generative model andpurely discriminative learning.

13

2.5 Optimization

Model M1 is trained by first learn qφ(z|x) using the objective function from eq.2.1, then train a classifier using the new representation. Training M2 is similarto M1, but it can both learn qφ(y|x) and qφ(z|x, y) jointly. Finally, the stackingmodel (M1+M2) is trained by first training M1 and then using qφ(z|x) to mapx to a latent space. Using this new representation as an data for training modelM2. For more implementation detail please refer to [2]

2.6 Classification Performance

The stacking model (M1+M2) is very effective. This shows that by using M2as a prior, it significantly boosts the discriminative representation of the datain the latent space.

14

Bibliography

[1] Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013.

[2] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and MaxWelling. Semi-supervised learning with deep generative models. In Advancesin Neural Information Processing Systems, pages 3581–3589, 2014.

15

variational autoencoders and extensions...

Documents