from gan to wasserstein ganheatmap of the generator distribution after increasing numbers of...
TRANSCRIPT
From GAN to Wasserstein GAN
Peng Xu
Columbia University in the City of New York
April 12, 2020
P. Xu (Columbia) GAN2WGAN April 12, 2020 1 / 40
Introduction
Given a dataset {(y1, x1), . . . , (yn, xn)} (or simply {x1, . . . , xn}), itwould be very useful to learn the data generating distributions, i.e.P(X | Y ), P(X ,Y ), and P(X ). We can use them to makeclassifications and generate new data instances.
Models that go directly after the distribution of the data is calledgenerative. Some of the classical generative learning models includes:Naıve Bayes, LDA, QDA, etc.
The Generative Adversarial Network (GAN, [Goo+14]) is a generativemodel that utilizes neural networks to learn the data distribution andgenerate new samples. Due to its promising performance, GAN hasbecame extremely popular in the field of machine learning since itsintroduction.
P. Xu (Columbia) GAN2WGAN April 12, 2020 2 / 40
Introduction
The idea of GAN is to train not one, but two models simultaneously: adiscriminator D and a generator G . Then let them participate in anadversarial game.
The generator G is trained to generate samples to fool thediscriminator, and the discriminator D is trained to distinguishbetween real data and fake samples generated by G .
Through the “competition” between these two players, we hope thattheir performances will improve and eventually become satisfactory.
In the language of game theory, we train D and G until they reach aNash equilibrium.
P. Xu (Columbia) GAN2WGAN April 12, 2020 3 / 40
Introduction
To be more specific, the generator Gθ(z) is a differentiable functionwith parameters θ, that takes random noises z as input and producesa sample in the data space X as output.
The noises z are the latent variables associated with the data. Thissetup is analogous to the inverse transform sampling: if we know theCDF FX for some X , then we can sample F−1
X (U) ∼ X withU ∼ U [0, 1].
The discriminator Dδ(x), a function of parameters δ, takes a sample xand returns the score (probability) that x came from the real datarather than G .
Both D and G in the original GAN are multilayer perceptrons. A morepopular architecture, deep convolutional generative adversarial network(DCGAN, [RMC15]), replaced them by deep CNNs.
P. Xu (Columbia) GAN2WGAN April 12, 2020 4 / 40
Objective of GAN
Hereinafter, we will denote the noise distribution as Z , the real datadistribution as XR , and the approximated data distribution (by thegenerator) as XG .
Formally, the training of GAN is formulated as a minimax problem:
minθ
maxδ
Vδ,θ(D,G ),
where
Vδ,θ(D,G ) := EXR[lnDδ(x)] + EZ
[ln(1− Dδ
[Gθ(z)
])].
The next slide will present an alternating algorithm for solving thisminimax problem.
P. Xu (Columbia) GAN2WGAN April 12, 2020 5 / 40
Algorithm of GAN
Input: A noise prior Z and a data generating distribution XR ;for t ← 1, . . . ,T do
for k ← 1, . . . ,K doSample noises {z1, . . . , zm} from Z ;Sample data {x1, . . . , xm} from XR ;Update the discriminator D by stochastic gradient ascend:
δ ← δ + α∇δ1
m
m∑i=1
[lnDδ(xi ) + ln
(1− Dδ
[G(zi )
])]endSample noises {z1, . . . , zm} from Z ;Update the generator G by stochastic gradient descend:
θ ← θ − α∇θ1
m
m∑i=1
ln(1− D
[Gθ(zi )
])end
Algorithm 1: Minibatch training of generative adversarial nets.
P. Xu (Columbia) GAN2WGAN April 12, 2020 6 / 40
Algorithm of GAN
This algorithm alternates between K steps of optimizing D and 1 stepof optimizing G . The discriminator D was trained multiple times inthe hope that it will be maintained near its optimal solution.
The number of steps K is a hyper-parameter chosen to preventprohibitive computation and over-fitting in the inner loop.
The gradient-based updates can be replaced by any standardgradient-based learning rule, e.g. RMSProp, ADAM. (See [Rud16] fordetails of these methods.)
We will next examine the objective of GAN in detail. Before doing so,however, let us review two distance measures for distributions:Kullback-Leibler and Jensen-Shannon.
P. Xu (Columbia) GAN2WGAN April 12, 2020 7 / 40
Objective of GAN
Let µ and ν be probability measures over X with µ� ν. TheKullback-Leibler divergence between µ and ν is defined as
KL(µ ‖ ν) :=
∫X
lndµ
dνdµ =
∫X
lnf (x)
g(x)f (x) dx ,
where f and g are the respective densities of µ and ν, assuming thatthey exist. This divergence is asymmetric, it can also be infinite if thetwo measures do not share a common support.
A measure closely connected to Kullback-Leibler is theJensen-Shannon divergence:
JS(µ ‖ ν) = KL
(µ
∥∥∥∥ µ+ ν
2
)+ KL
(ν
∥∥∥∥ µ+ ν
2
).
Unlike the Kullback-Leibler, this divergence is symmetrical and alwaysdefined.
P. Xu (Columbia) GAN2WGAN April 12, 2020 8 / 40
Objective of GAN
Through a change-of-variable, V (D,G ) can be expanded as
V (D,G ) =
∫X
(fXR(x) ln[D(x)] + fXG
(x) ln[1− D(x)]) dx .
Taking derivative of the integrand with respect to D(x), it’s notdifficult to see that for a fixed G , V (D,G ) attains its maximum at
D∗(x) =fXR
(x)
fXR(x) + fXG
(x).
In particular, if the generator does a perfect job so that PXG
a.e.= PXR
,we will have D∗(x) = 1/2. Plug this back into V (D,G ), we get
V ∗(D,G ) = −2 ln 2.
P. Xu (Columbia) GAN2WGAN April 12, 2020 9 / 40
Objective of GAN
By holding D at its optimum, we can derive with some algebra that
V (D∗,G ) = JS(PXR‖ PXG
)− 2 ln 2.
Thus, the generator updating step in the GAN algorithm can be seenas minimizing the JS divergence between XG and XR .
This also proves V ∗ = −2 ln 2 is a global minimum, since JS(µ ‖ ν) is
non-negative and zero if and only if µa.e.= ν.
P. Xu (Columbia) GAN2WGAN April 12, 2020 10 / 40
Performance of GAN
GAN’s success at generating realistically samples, it seems, can largelybe contributed to the use of JS divergence instead of KL.
To see why, let us first review a fundamental assumption in manifoldlearning.
P. Xu (Columbia) GAN2WGAN April 12, 2020 11 / 40
Performance of GAN
Known as the manifold hypothesis, this assumption states that highdimensional data, in plenty cases of interest, lie in the vicinity of a lowdimensional manifold.
Evidences suggest this assumption is at least approximately correct formany tasks involving images, sounds, or text. ([GBC16])
A heuristic argument for the hypothesis is that if we samples imagesuniformly at random, then we will almost always observe noises andnever encounter an image with discernible features. Therefore, the setof interesting images must be negligible compare to the entire imagespace.
P. Xu (Columbia) GAN2WGAN April 12, 2020 12 / 40
Performance of GAN
Classical generative models often fit by maximizing likelihood, whichasymptotically is equivalent to minimizing KL(PXR
‖ PXG).
However, if the manifold hypothesis is true, then the intersection ofthe supports of XR and XG is likely negligible, in which case the KLdivergence will become +∞.
The typical remedy is to add a noise term to the model distributionwhich forces the supports to overlap. But this in turn degrades thequality of the generated samples.
P. Xu (Columbia) GAN2WGAN April 12, 2020 13 / 40
Performance of GAN
Furthermore, imagine trying to model a multi-modal PXRwith a
uni-modal PXG.
Minimizing KL(PXR‖ PXG
) corresponds to moment matching, whichhas a tendency to overgeneralize and generate implausible samples.
Minimizing KL(PXG‖ PXR
), on the other hand, leads to amode-seeking behavior. The optimal PXG
will typically concentratearound the largest mode of PXR
while ignoring the rest.
The JS divergence can be regarded as an interpolation between thetwo KL divergences. This fact likely alleviates many of the issuescaused by the latter. ([Hus15])
The JS divergence is also smoother, and defined even in the case ofnon-overlapping support. These properties suggest it is nicer metric tooptimize.
P. Xu (Columbia) GAN2WGAN April 12, 2020 14 / 40
Problems in GAN
Even though the JS divergence has appealing properties and GAN hasbeen proven to be successful in practice. They are far from perfect.
The training of GAN is often difficult and unstable. Some of its mostprominent problems include:
• Vanishing gradient;• Mode collapse;• Non-convergence.
P. Xu (Columbia) GAN2WGAN April 12, 2020 15 / 40
Problems in GAN: Vanishing Gradient
JS divergence is not free from the problem caused by the manifoldhypothesis. If the supports of XR and XG are disjoint or lie on lowdimensional manifolds, then (almost surely)
JS(PXR‖ PXG
) = 2 ln 2
even if the manifolds lie arbitrarily close. ([AB17])
The loss function of GAN is therefore discontinuous near its optimality,and minimize it at this point via gradient is impossible.
P. Xu (Columbia) GAN2WGAN April 12, 2020 16 / 40
Problems in GAN: Vanishing Gradient
In addition, if the manifold hypothesis is true, then (again almostsurely) there exists an optimal smooth discriminator D∗ that discernsXR and XG perfectly. It is not difficult to prove that as D → D∗,∥∥∇θ EZ
[ln(1− D
[Gθ(z)
])]∥∥2→ 0.
Simply put, as the discriminator becomes better and better throughtraining, the gradient for the generator becomes smaller and smaller.The model updates are therefore increasingly negligible.
P. Xu (Columbia) GAN2WGAN April 12, 2020 17 / 40
Problems in GAN: Vanishing Gradient
The original GAN paper has noticed this issue and proposed replacing+ EZ
[ln(1− D
[G (z)
])]by −EZ
[lnD
[G (z)
]]in training G .
However, it has been shown in [AB17] that, under some strongassumptions, coordinates of EZ
[−∇θ lnD
[Gθ(z)
]]are Cauchy, which
has undefined mean and variance, implying instability.
While the assumptions for this result are a bit unrealistic, empiricalstudies have shown that this alternative gradient indeed becomesincreasingly unstable as the discriminator gets better.
P. Xu (Columbia) GAN2WGAN April 12, 2020 18 / 40
Problems in GAN: Mode Collapse
Another serious issue faced by GAN is mode collapsing, which meansthe generator fails to output diverse samples. For example, the truedistribution XR is multi-modal, but XG concentrates only around fewof the modes, or even a single point.
At mode collapse, the generator can still produces good qualitysamples, and the discriminator will not classify them as fake. Thisprevents the generator from improving further.
P. Xu (Columbia) GAN2WGAN April 12, 2020 19 / 40
Problems in GAN: Mode Collapse
Figure 1: A demonstration of mode collapse from [Met+16]. Columns show aheatmap of the generator distribution after increasing numbers of training steps.The generator rotates through the modes of the data distribution. It neverconverges to a fixed distribution, and only ever assigns significant probability massto a single data mode at once.
P. Xu (Columbia) GAN2WGAN April 12, 2020 20 / 40
Problems in GAN: Mode Collapse
As in the previous cases, the KL(PXG‖ PXR
) component in the GANobjective may be the culprit.
Heuristic techniques, e.g. mini-batch discrimination and historicalaveraging, have been introduced to deal with the problem. ([Sal+16])
Extended models, such as the unrolled GAN ([Met+16]), are shown tobe effective against mode collapse.
Certain non-GAN models, such as the Implicit Maximum LikelihoodEstimation ([LM18]), also provide insight and potential solutions to themode collapsing issue. They may be helpful in directing future studies.
P. Xu (Columbia) GAN2WGAN April 12, 2020 21 / 40
Problems in GAN: Non-convergence
Recall that the algorithm of GAN can be interpreted as finding a Nashequilibrium for a minimax problem through stochastic gradientascend/decent.
Unfortunately, stochastic gradient methods are not designed for thiskind of problems. There is no theoretical guarantee that the algorithmwill ever converge.
After all, the training of GAN is a highly heuristic process and furtherstudies are still in need.
P. Xu (Columbia) GAN2WGAN April 12, 2020 22 / 40
Wasserstein GAN
In an attempt to cope with some of the aforementioned difficulties,the Wasserstein Generative Adversarial Network (WGAN, [ACB17])was introduced.
As the name implied, it replaced the JS divergence in the GANobjective with the Wasserstein distance.
Wasserstein distance is deeply connected to the problem of optimaltransport. Both of which have become of great interest in recent years.
P. Xu (Columbia) GAN2WGAN April 12, 2020 23 / 40
Wasserstein GAN
Let µ and ν be probability measures over X , and let Γ(µ, ν) be theset of all measures over X × X whose respective marginals are µ andν. The Wasserstein-p distance between µ and ν is defined as
Wp(µ, ν) :=
(inf
γ∈Γ(µ,ν)
∫X×X‖x − y‖p dγ(x , y)
)1/p
.
The WGAN utilized specifically the Wasserstein-1 distance (alsoknown as the earth-mover distance). It can be expressed as
W1(µ, ν) = infγ∈Γ(µ,ν)
Eγ [‖X − Y ‖] ,
where (X ,Y ) ∼ γ.
P. Xu (Columbia) GAN2WGAN April 12, 2020 24 / 40
Wasserstein GAN
In the context of GAN, the choice of W1 offers several benefits overthe JS divergence.
An important property of W1 is that it induces the weak topology.
That is, W1(µn, µ)→ 0 if and only if µnD−→ µ. This is comparatively
weaker than the JS convergence, which is equivalent to theconvergence in total variation.
Intuitively, the easier the convergence, the easier it is to have acontinuous mapping between θ and PXG
.
P. Xu (Columbia) GAN2WGAN April 12, 2020 25 / 40
Wasserstein GAN
Indeed, if Gθ(z) is continuous in θ, so is W1(PXR,PXG
).
Moreover, if Gθ(z) is locally Lipschitz and satisfies a mild regularitycondition, then W1(PXR
,PXG) is continuous everywhere and
differentiable almost everywhere.
The mild condition we mentioned is automatically satisfied by anyfeed-forward neural networks, which are function composed by affinetransformations and pointwise nonlinearities that are smooth Lipschitzfunctions.
As a result, W1 as an objective in GAN is expected to behave muchbetter than the JS divergence.
P. Xu (Columbia) GAN2WGAN April 12, 2020 26 / 40
Wasserstein GAN
Unfortunately, computing the Wp distance is very difficult. It is usuallydone in the Kantorovich-Rubinstein dual form:
Wpp(µ, ν)
p= sup
ψ∈Ψp(X )
∫Xψ dµ+
∫Xψcp dν,
ψcp here denotes the cp-conjugate of ψ:
ψcp(y) := infx∈X
cp(x , y)− ψ(x), cp(x , y) :=‖x − y‖p
p.
Ψp(X ) is the set of all cp-concave functions, i.e. functions that can bewritten as cp-conjugate of some other functions.
P. Xu (Columbia) GAN2WGAN April 12, 2020 27 / 40
Wasserstein GAN
Incidentally, the dual form of W1 distance is quite tame, namely:
W1(µ, ν) = supψ∈Ψ1(X )
Eµ[ψ]− Eν [ψ],
where the c1-concave set is in fact the set of all 1-Lipschitz function:
Ψ1(X ) = {ψ : X → R | ‖ψ(x)− ψ(y)‖ ≤ ‖x − y‖ ∀x , y ∈ X}.
P. Xu (Columbia) GAN2WGAN April 12, 2020 28 / 40
Wasserstein GAN
Optimize over all 1-Lipschitz functions is no easy task. We canhowever discretize it in the following way: suppose {ψw}w∈W , forsome W compact, is a parameterized family of functions that are allK -Lipschitz for some K , then solving
maxw∈W
EXR[ψw ]− EXG
[ψw ]
yield W1(PXR,PXG
) up to a multiplicative constant.
Notably, we have
∇θ W1(PXR,PXG
) = −EZ[∇θψw
(Gθ(z)
)],
which makes gradient methods possible.
The next slide will present an algorithm for fitting WGAN.
P. Xu (Columbia) GAN2WGAN April 12, 2020 29 / 40
Algorithm of WGAN
Input: A noise prior Z and a data generating distribution XR ;for t ← 1, . . . ,T do
for k ← 1, . . . ,K doSample noises {z1, . . . , zm} from Z ;Sample data {x1, . . . , xm} from XR ;Update ψw by stochastic gradient ascend:
w ← w + α∇w1
m
m∑i=1
[ψw (xi )− ψw
(G(z)
)]Clip the components of w to [−c, c];
endSample noises {z1, . . . , zm} from Z ;Update the generator G by stochastic gradient descend:
θ ← θ + α∇θ1
m
m∑i=1
ψw [Gθ(z)]
endAlgorithm 2: Minibatch training of Wasserstein generative adversarial nets.
P. Xu (Columbia) GAN2WGAN April 12, 2020 30 / 40
Algorithm of WGAN
Akin to GAN, this algorithm alternates between updating ψw andoptimizing G . Note that ψ is no longer a “discriminator”, but a helperin computing W1. (It is in fact learning the Kantorovich potential.)
To ensure the parameters w lies in a compact space, they are clippedto the hypercube [−c , c]d after each gradient update.
P. Xu (Columbia) GAN2WGAN April 12, 2020 31 / 40
Performance of WGAN
The use of Wasserstein distance significantly alleviates the problem ofvanish gradients, as demonstrated by the following figure:
Figure 2: Optimal discriminator and critic when learning to differentiate twoGaussians. The GAN discriminator of saturates and results in vanishinggradients. The WGAN critic provides very clean gradients on all parts of thespace.
P. Xu (Columbia) GAN2WGAN April 12, 2020 32 / 40
Performance of WGAN
This means that the discriminator can be trained till optimality, whichin turn helps alleviating the mode collapsing issue.
Nonetheless, the use of parameter clipping remains an issue.
• Choosing a right clipping parameter c is critical: if c is large, thenψ will take longer to reach its optimality; if c is small, then theproblem of vanishing gradients will again appear. ([ACB17])• Moreover, parameter clipping biases ψw towards simple functions.
As a result, the optimal ψ∗ is not in the class of functions thatcan be generated by the network. ([Gul+17], [PFL17])
P. Xu (Columbia) GAN2WGAN April 12, 2020 33 / 40
Improving WGAN
An alternatives to the parameter clipping is to use gradient penalty(WGAN-GP, [Gul+17]). Specifically, the following term was addedwhen training the discriminator of WGAN:
λEX
[(‖∇ψ(x)‖2 − 1)2
].
X is defined, implicitly, by sampling uniformly along straight linesbetween pairs of points from XR and XG . That is,
x = αx + (1− α)y ,
where α ∼ U[0, 1], x ∼ XR , y ∼ XG .
P. Xu (Columbia) GAN2WGAN April 12, 2020 34 / 40
Improving WGAN
This choice of penalty is based on the following observation: supposeγ∗ is the optimal coupling for W1, and ψ∗ is the optimal solution forthe dual. Let x = αx + (1− α)y , where (x , y) ∼ γ∗ and α ∈ [0, 1]. Ifψ∗ is differentiable, then
P
(∇ψ∗(x) =
y − x‖y − x‖
)= 1.
This implies γ∗-a.s. that ‖∇ψ∗(x)‖ = 1.
In order for this to work, however, x and y need to be sampled fromthe optimal coupling, which is not the case in WGAN-GP. In practice,WGAN-GP still suffers from instability.
P. Xu (Columbia) GAN2WGAN April 12, 2020 35 / 40
Improving WGAN
To improve WGAN-GP, it has been proposed (WGAN-LP, ([PFL17])to replace the gradient penalty by its one-sided counterpart:
λEX
[max (0, ‖∇ψ(x)‖2 − 1)2
],
where x is sampled by concatenating and perturbing both x ∼ XR andy ∼ XG ([Kod+17]). This way, the Lipschitz constraint on ψ ispenalized directly.
Compare to WGAN-GP, WGAN-LP seems to be much more stable intraining ψ. It is also less sensitive to the choice of λ.
P. Xu (Columbia) GAN2WGAN April 12, 2020 36 / 40
Discussion
We have introduced the basic of GAN and WGAN, and discussed whythe latter is more attractive on a mathematical ground.
A large-scale simulation study ([Luc+18]) was conducted to comparethe performance of several GAN variants (including WGAN andWGAN-GP). It reported that, given enough computation budget andthorough hyper-parameters tuning, variants of GAN were unable tobeat each other consistently.
We may consider, in the future, investigating the implications of thisobservation.
P. Xu (Columbia) GAN2WGAN April 12, 2020 37 / 40
References I
Martin Arjovsky and Leon Bottou. Towards PrincipledMethods for Training Generative Adversarial Networks. 2017.arXiv: 1701.04862 [stat.ML].
Martin Arjovsky, Soumith Chintala, and Leon Bottou.Wasserstein GAN. 2017. arXiv: 1701.07875 [stat.ML].
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. DeepLearning. http://www.deeplearningbook.org. MIT Press,2016.
Ian Goodfellow et al. “Generative Adversarial Nets”. In:Advances in Neural Information Processing Systems 27. Ed. byZ. Ghahramani et al. Curran Associates, Inc., 2014,pp. 2672–2680. url:http://papers.nips.cc/paper/5423-generative-
adversarial-nets.pdf.
P. Xu (Columbia) GAN2WGAN April 12, 2020 38 / 40
References II
Ishaan Gulrajani et al. Improved Training of Wasserstein GANs.2017. arXiv: 1704.00028 [cs.LG].
Ferenc Huszar. How (not) to Train your Generative Model:Scheduled Sampling, Likelihood, Adversary? 2015. arXiv:1511.05101 [stat.ML].
Naveen Kodali et al. On Convergence and Stability of GANs.2017. arXiv: 1705.07215 [cs.AI].
Ke Li and Jitendra Malik. Implicit Maximum LikelihoodEstimation. 2018. arXiv: 1809.09087 [cs.LG].
Mario Lucic et al. “Are GANs Created Equal? A Large-ScaleStudy”. In: Advances in Neural Information Processing Systems31. Ed. by S. Bengio et al. Curran Associates, Inc., 2018,pp. 700–709. url: http://papers.nips.cc/paper/7350-are-gans-created-equal-a-large-scale-study.pdf.
P. Xu (Columbia) GAN2WGAN April 12, 2020 39 / 40
References III
Luke Metz et al. Unrolled Generative Adversarial Networks.2016. arXiv: 1611.02163 [cs.LG].
Henning Petzka, Asja Fischer, and Denis Lukovnicov. On theregularization of Wasserstein GANs. 2017. arXiv: 1709.08894[stat.ML].
Alec Radford, Luke Metz, and Soumith Chintala. UnsupervisedRepresentation Learning with Deep Convolutional GenerativeAdversarial Networks. 2015. arXiv: 1511.06434 [cs.LG].
Sebastian Ruder. An overview of gradient descent optimizationalgorithms. 2016. arXiv: 1609.04747 [cs.LG].
Tim Salimans et al. Improved Techniques for Training GANs.2016. arXiv: 1606.03498 [cs.LG].
P. Xu (Columbia) GAN2WGAN April 12, 2020 40 / 40