nishant gurnani - amazon s3 · 2017-10-05 · nishant gurnani gan reading group april 14th, 2017...

107
Nishant Gurnani GAN Reading Group April 14th, 2017 1 / 107

Upload: others

Post on 11-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Nishant Gurnani

GAN Reading Group

April 14th, 2017

1 / 107

Page 2: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Why are these Papers Important?

I Recently a large number of GAN frameworks have beenproposed - BGAN, LSGAN, DCGAN, DiscoGAN . . .

I Wasserstein GAN is yet another GAN training algorithm,however it is backed up by rigorous theory in addition to goodperformance

I WGAN removes the need to balance generator updates withdiscriminator updates, thus removing a key source of traininginstability in the original GAN

I Empirical results show correlation between discriminator lossand perceptual quality thus providing a rough measure oftraining progress

Goal - Convince you that WGAN is the “best” GAN

2 / 107

Page 3: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Why are these Papers Important?

I Recently a large number of GAN frameworks have beenproposed - BGAN, LSGAN, DCGAN, DiscoGAN . . .

I Wasserstein GAN is yet another GAN training algorithm,however it is backed up by rigorous theory in addition to goodperformance

I WGAN removes the need to balance generator updates withdiscriminator updates, thus removing a key source of traininginstability in the original GAN

I Empirical results show correlation between discriminator lossand perceptual quality thus providing a rough measure oftraining progress

Goal - Convince you that WGAN is the “best” GAN

3 / 107

Page 4: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Why are these Papers Important?

I Recently a large number of GAN frameworks have beenproposed - BGAN, LSGAN, DCGAN, DiscoGAN . . .

I Wasserstein GAN is yet another GAN training algorithm,however it is backed up by rigorous theory in addition to goodperformance

I WGAN removes the need to balance generator updates withdiscriminator updates, thus removing a key source of traininginstability in the original GAN

I Empirical results show correlation between discriminator lossand perceptual quality thus providing a rough measure oftraining progress

Goal - Convince you that WGAN is the “best” GAN

4 / 107

Page 5: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Why are these Papers Important?

I Recently a large number of GAN frameworks have beenproposed - BGAN, LSGAN, DCGAN, DiscoGAN . . .

I Wasserstein GAN is yet another GAN training algorithm,however it is backed up by rigorous theory in addition to goodperformance

I WGAN removes the need to balance generator updates withdiscriminator updates, thus removing a key source of traininginstability in the original GAN

I Empirical results show correlation between discriminator lossand perceptual quality thus providing a rough measure oftraining progress

Goal - Convince you that WGAN is the “best” GAN

5 / 107

Page 6: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Why are these Papers Important?

I Recently a large number of GAN frameworks have beenproposed - BGAN, LSGAN, DCGAN, DiscoGAN . . .

I Wasserstein GAN is yet another GAN training algorithm,however it is backed up by rigorous theory in addition to goodperformance

I WGAN removes the need to balance generator updates withdiscriminator updates, thus removing a key source of traininginstability in the original GAN

I Empirical results show correlation between discriminator lossand perceptual quality thus providing a rough measure oftraining progress

Goal - Convince you that WGAN is the “best” GAN

6 / 107

Page 7: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Why are these Papers Important?

I Recently a large number of GAN frameworks have beenproposed - BGAN, LSGAN, DCGAN, DiscoGAN . . .

I Wasserstein GAN is yet another GAN training algorithm,however it is backed up by rigorous theory in addition to goodperformance

I WGAN removes the need to balance generator updates withdiscriminator updates, thus removing a key source of traininginstability in the original GAN

I Empirical results show correlation between discriminator lossand perceptual quality thus providing a rough measure oftraining progress

Goal - Convince you that WGAN is the “best” GAN

7 / 107

Page 8: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Outline

Introduction

Different Distances

Standard GAN

Wasserstein GAN

Improved Wasserstein GAN

8 / 107

Page 9: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Outline

Introduction

Different Distances

Standard GAN

Wasserstein GAN

Improved Wasserstein GAN

9 / 107

Page 10: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

What does it mean to learn a probability distribution?

When learning generative models, we assume the data we havecomes from some unknown distribution Pr .

Want to learn a distribution Pθ that approximates Pr , where θ arethe parameters of the distribution.

There are two approaches for doing this:

1. Directly learn the probability density function Pθ and thenoptimize through maximum likelihood estimation

2. Learn a function that transforms an existing distribution Zinto Pθ. Here, gθ is some differentiable function, Z is acommon distribution (usually uniform or Gaussian), andPθ = gθ(Z )

10 / 107

Page 11: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

What does it mean to learn a probability distribution?

When learning generative models, we assume the data we havecomes from some unknown distribution Pr .

Want to learn a distribution Pθ that approximates Pr , where θ arethe parameters of the distribution.

There are two approaches for doing this:

1. Directly learn the probability density function Pθ and thenoptimize through maximum likelihood estimation

2. Learn a function that transforms an existing distribution Zinto Pθ. Here, gθ is some differentiable function, Z is acommon distribution (usually uniform or Gaussian), andPθ = gθ(Z )

11 / 107

Page 12: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

What does it mean to learn a probability distribution?

When learning generative models, we assume the data we havecomes from some unknown distribution Pr .

Want to learn a distribution Pθ that approximates Pr , where θ arethe parameters of the distribution.

There are two approaches for doing this:

1. Directly learn the probability density function Pθ and thenoptimize through maximum likelihood estimation

2. Learn a function that transforms an existing distribution Zinto Pθ. Here, gθ is some differentiable function, Z is acommon distribution (usually uniform or Gaussian), andPθ = gθ(Z )

12 / 107

Page 13: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Maximum Likelihood approach is problematic

Recall that for continuous distributions P and Q the KL divergenceis:

KL(P||Q) =

∫xP(x) log

P(x)

Q(x)dx

and given function Pθ, the MLE objective is

maxθ∈Rd

1

m

m∑i=1

logPθ(x (i))

In the limit (as m→∞), samples will appear based on the data

distribution Pr , so

limm→∞

maxθ∈Rd

1

m

m∑i=1

logPθ(x (i)) = maxθ∈Rd

∫xPr (x) logPθ(x)dx

13 / 107

Page 14: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Maximum Likelihood approach is problematic

limm→∞

maxθ∈Rd

1

m

m∑i=1

logPθ(x (i)) = maxθ∈Rd

∫xPr (x) logPθ(x)dx

= minθ∈Rd−∫xPr (x) logPθ(x)dx

= minθ∈Rd

∫xPr (x) logPr (x)dx −

∫xPr (x) logPθ(x)dx

= minθ∈Rd

KL(Pr ||Pθ)

I Note if Pθ = 0 at an x where Pr > 0, the KL divergence goesto +∞ (bad for the MLE if Pθ has low dimensional support)

I Typical remedy is to add a noise term to the modeldistribution to ensure distribution is defined everywhere

I This unfortunately introduces some error, and empiricallypeople have needed to add a lot of random noise to makemodels train

14 / 107

Page 15: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Maximum Likelihood approach is problematic

limm→∞

maxθ∈Rd

1

m

m∑i=1

logPθ(x (i)) = maxθ∈Rd

∫xPr (x) logPθ(x)dx

= minθ∈Rd−∫xPr (x) logPθ(x)dx

= minθ∈Rd

∫xPr (x) logPr (x)dx −

∫xPr (x) logPθ(x)dx

= minθ∈Rd

KL(Pr ||Pθ)

I Note if Pθ = 0 at an x where Pr > 0, the KL divergence goesto +∞ (bad for the MLE if Pθ has low dimensional support)

I Typical remedy is to add a noise term to the modeldistribution to ensure distribution is defined everywhere

I This unfortunately introduces some error, and empiricallypeople have needed to add a lot of random noise to makemodels train

15 / 107

Page 16: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Maximum Likelihood approach is problematic

limm→∞

maxθ∈Rd

1

m

m∑i=1

logPθ(x (i)) = maxθ∈Rd

∫xPr (x) logPθ(x)dx

= minθ∈Rd−∫xPr (x) logPθ(x)dx

= minθ∈Rd

∫xPr (x) logPr (x)dx −

∫xPr (x) logPθ(x)dx

= minθ∈Rd

KL(Pr ||Pθ)

I Note if Pθ = 0 at an x where Pr > 0, the KL divergence goesto +∞ (bad for the MLE if Pθ has low dimensional support)

I Typical remedy is to add a noise term to the modeldistribution to ensure distribution is defined everywhere

I This unfortunately introduces some error, and empiricallypeople have needed to add a lot of random noise to makemodels train

16 / 107

Page 17: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Maximum Likelihood approach is problematic

limm→∞

maxθ∈Rd

1

m

m∑i=1

logPθ(x (i)) = maxθ∈Rd

∫xPr (x) logPθ(x)dx

= minθ∈Rd−∫xPr (x) logPθ(x)dx

= minθ∈Rd

∫xPr (x) logPr (x)dx −

∫xPr (x) logPθ(x)dx

= minθ∈Rd

KL(Pr ||Pθ)

I Note if Pθ = 0 at an x where Pr > 0, the KL divergence goesto +∞ (bad for the MLE if Pθ has low dimensional support)

I Typical remedy is to add a noise term to the modeldistribution to ensure distribution is defined everywhere

I This unfortunately introduces some error, and empiricallypeople have needed to add a lot of random noise to makemodels train

17 / 107

Page 18: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Transforming an existing distribution

Shortcomings of the maximum likelihood approach motivate thesecond approach of learning a gθ (a generator) to transform aknown distribution Z.

Advantages of this approach:

I Unlike densities, this approach can represent distributionsconfined to a low dimensional manifold

I It’s very easy to generate samples - given a trained gθ, simplysample random noise z ∼ Z , and evaluate gθ(z)

VAEs and GANs are well known examples of this approach

VAEs focus on the approximate likelihood of the examples and soshare the limitation that you need to fiddle with additional noiseterms.

GANs offer much more flexibility but their training is unstable.

18 / 107

Page 19: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Transforming an existing distribution

Shortcomings of the maximum likelihood approach motivate thesecond approach of learning a gθ (a generator) to transform aknown distribution Z.

Advantages of this approach:

I Unlike densities, this approach can represent distributionsconfined to a low dimensional manifold

I It’s very easy to generate samples - given a trained gθ, simplysample random noise z ∼ Z , and evaluate gθ(z)

VAEs and GANs are well known examples of this approach

VAEs focus on the approximate likelihood of the examples and soshare the limitation that you need to fiddle with additional noiseterms.

GANs offer much more flexibility but their training is unstable.

19 / 107

Page 20: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Transforming an existing distribution

Shortcomings of the maximum likelihood approach motivate thesecond approach of learning a gθ (a generator) to transform aknown distribution Z.

Advantages of this approach:

I Unlike densities, this approach can represent distributionsconfined to a low dimensional manifold

I It’s very easy to generate samples - given a trained gθ, simplysample random noise z ∼ Z , and evaluate gθ(z)

VAEs and GANs are well known examples of this approach

VAEs focus on the approximate likelihood of the examples and soshare the limitation that you need to fiddle with additional noiseterms.

GANs offer much more flexibility but their training is unstable.

20 / 107

Page 21: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Transforming an existing distribution

Shortcomings of the maximum likelihood approach motivate thesecond approach of learning a gθ (a generator) to transform aknown distribution Z.

Advantages of this approach:

I Unlike densities, this approach can represent distributionsconfined to a low dimensional manifold

I It’s very easy to generate samples - given a trained gθ, simplysample random noise z ∼ Z , and evaluate gθ(z)

VAEs and GANs are well known examples of this approach

VAEs focus on the approximate likelihood of the examples and soshare the limitation that you need to fiddle with additional noiseterms.

GANs offer much more flexibility but their training is unstable.

21 / 107

Page 22: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Transforming an existing distribution

Shortcomings of the maximum likelihood approach motivate thesecond approach of learning a gθ (a generator) to transform aknown distribution Z.

Advantages of this approach:

I Unlike densities, this approach can represent distributionsconfined to a low dimensional manifold

I It’s very easy to generate samples - given a trained gθ, simplysample random noise z ∼ Z , and evaluate gθ(z)

VAEs and GANs are well known examples of this approach

VAEs focus on the approximate likelihood of the examples and soshare the limitation that you need to fiddle with additional noiseterms.

GANs offer much more flexibility but their training is unstable.

22 / 107

Page 23: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Transforming an existing distribution

To train gθ (and by extension Pθ), we need a measure of distancebetween distributions i.e. d(Pr ,Pθ).

Distance Properties

I Distance d is weaker than distance d ′ if every sequence thatconverges under d ′ converges under d

I Given a distance d , we can treat d(Pr ,Pθ) as a loss function

I We can minimize d(Pr ,Pθ) with respect to θ as long as themapping θ 7→ Pθ is continuous (true if gθ is a neural network)

How do we define d?

23 / 107

Page 24: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Transforming an existing distribution

To train gθ (and by extension Pθ), we need a measure of distancebetween distributions i.e. d(Pr ,Pθ).

Distance Properties

I Distance d is weaker than distance d ′ if every sequence thatconverges under d ′ converges under d

I Given a distance d , we can treat d(Pr ,Pθ) as a loss function

I We can minimize d(Pr ,Pθ) with respect to θ as long as themapping θ 7→ Pθ is continuous (true if gθ is a neural network)

How do we define d?

24 / 107

Page 25: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Transforming an existing distribution

To train gθ (and by extension Pθ), we need a measure of distancebetween distributions i.e. d(Pr ,Pθ).

Distance Properties

I Distance d is weaker than distance d ′ if every sequence thatconverges under d ′ converges under d

I Given a distance d , we can treat d(Pr ,Pθ) as a loss function

I We can minimize d(Pr ,Pθ) with respect to θ as long as themapping θ 7→ Pθ is continuous (true if gθ is a neural network)

How do we define d?

25 / 107

Page 26: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Transforming an existing distribution

To train gθ (and by extension Pθ), we need a measure of distancebetween distributions i.e. d(Pr ,Pθ).

Distance Properties

I Distance d is weaker than distance d ′ if every sequence thatconverges under d ′ converges under d

I Given a distance d , we can treat d(Pr ,Pθ) as a loss function

I We can minimize d(Pr ,Pθ) with respect to θ as long as themapping θ 7→ Pθ is continuous (true if gθ is a neural network)

How do we define d?

26 / 107

Page 27: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Transforming an existing distribution

To train gθ (and by extension Pθ), we need a measure of distancebetween distributions i.e. d(Pr ,Pθ).

Distance Properties

I Distance d is weaker than distance d ′ if every sequence thatconverges under d ′ converges under d

I Given a distance d , we can treat d(Pr ,Pθ) as a loss function

I We can minimize d(Pr ,Pθ) with respect to θ as long as themapping θ 7→ Pθ is continuous (true if gθ is a neural network)

How do we define d?

27 / 107

Page 28: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Transforming an existing distribution

To train gθ (and by extension Pθ), we need a measure of distancebetween distributions i.e. d(Pr ,Pθ).

Distance Properties

I Distance d is weaker than distance d ′ if every sequence thatconverges under d ′ converges under d

I Given a distance d , we can treat d(Pr ,Pθ) as a loss function

I We can minimize d(Pr ,Pθ) with respect to θ as long as themapping θ 7→ Pθ is continuous (true if gθ is a neural network)

How do we define d?

28 / 107

Page 29: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Introduction

Different Distances

Standard GAN

Wasserstein GAN

Improved Wasserstein GAN

29 / 107

Page 30: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Distance definitions

Notationχ - compact metric set (such as the space of images [0, 1]d)Σ - set of all Borel subsets of χProb(χ) - space of probability measures defined on χ

Total Variation (TV) distance

δ(Pr ,Pθ) = supA∈Σ|Pr (A)− Pθ(A)|

Kullback-Leibler (KL) divergence

KL(Pr ||Pθ) =

∫xPr (x) log

Pr (x)

Pθ(x)dµ(x)

where both Pr and Pθ are assumed to be absolutely continuouswith respect to a same measure µ defined on χ.

30 / 107

Page 31: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Distance definitions

Notationχ - compact metric set (such as the space of images [0, 1]d)Σ - set of all Borel subsets of χProb(χ) - space of probability measures defined on χ

Total Variation (TV) distance

δ(Pr ,Pθ) = supA∈Σ|Pr (A)− Pθ(A)|

Kullback-Leibler (KL) divergence

KL(Pr ||Pθ) =

∫xPr (x) log

Pr (x)

Pθ(x)dµ(x)

where both Pr and Pθ are assumed to be absolutely continuouswith respect to a same measure µ defined on χ.

31 / 107

Page 32: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Distance definitions

Notationχ - compact metric set (such as the space of images [0, 1]d)Σ - set of all Borel subsets of χProb(χ) - space of probability measures defined on χ

Total Variation (TV) distance

δ(Pr ,Pθ) = supA∈Σ|Pr (A)− Pθ(A)|

Kullback-Leibler (KL) divergence

KL(Pr ||Pθ) =

∫xPr (x) log

Pr (x)

Pθ(x)dµ(x)

where both Pr and Pθ are assumed to be absolutely continuouswith respect to a same measure µ defined on χ.

32 / 107

Page 33: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Distance definitions

Notationχ - compact metric set (such as the space of images [0, 1]d)Σ - set of all Borel subsets of χProb(χ) - space of probability measures defined on χ

Total Variation (TV) distance

δ(Pr ,Pθ) = supA∈Σ|Pr (A)− Pθ(A)|

Kullback-Leibler (KL) divergence

KL(Pr ||Pθ) =

∫xPr (x) log

Pr (x)

Pθ(x)dµ(x)

where both Pr and Pθ are assumed to be absolutely continuouswith respect to a same measure µ defined on χ.

33 / 107

Page 34: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Distance Definitions

Jensen-Shannon (JS) Divergence

JS(Pr ,Pθ) = KL(Pr ||Pm) + KL(Pθ||Pm)

where Pm is the mixture (Pr + Pθ)/2

Earth-Mover (EM) distance or Wasserstein-1

W (Pr ,Pθ) = infγ∈Π((Pr ,Pθ)

E(x ,y)∼γ [||x − y ||]

where Π(Pr ,Pθ) denotes the set of all join distributions γ(x , y)whose marginals are respectively Pr and Pθ

34 / 107

Page 35: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Distance Definitions

Jensen-Shannon (JS) Divergence

JS(Pr ,Pθ) = KL(Pr ||Pm) + KL(Pθ||Pm)

where Pm is the mixture (Pr + Pθ)/2

Earth-Mover (EM) distance or Wasserstein-1

W (Pr ,Pθ) = infγ∈Π((Pr ,Pθ)

E(x ,y)∼γ [||x − y ||]

where Π(Pr ,Pθ) denotes the set of all join distributions γ(x , y)whose marginals are respectively Pr and Pθ

35 / 107

Page 36: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Distance Definitions

Jensen-Shannon (JS) Divergence

JS(Pr ,Pθ) = KL(Pr ||Pm) + KL(Pθ||Pm)

where Pm is the mixture (Pr + Pθ)/2

Earth-Mover (EM) distance or Wasserstein-1

W (Pr ,Pθ) = infγ∈Π((Pr ,Pθ)

E(x ,y)∼γ [||x − y ||]

where Π(Pr ,Pθ) denotes the set of all join distributions γ(x , y)whose marginals are respectively Pr and Pθ

36 / 107

Page 37: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Understanding the EM distance

Main IdeaProbability distributions are defined by how much mass they puton each point.

Imagine we started with distribution Pr , and wanted to move massaround to change the distribution into Pθ.

Moving mass m by distance d costs m · d effort.

The earth mover distance is the minimal effort we need to spend.

Transport Plan

Each γ ∈ Π is a transport plan and to execute the plan, for all x , ymove γ(x , y) mass from x to y .

37 / 107

Page 38: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Understanding the EM distance

Main IdeaProbability distributions are defined by how much mass they puton each point.

Imagine we started with distribution Pr , and wanted to move massaround to change the distribution into Pθ.

Moving mass m by distance d costs m · d effort.

The earth mover distance is the minimal effort we need to spend.

Transport Plan

Each γ ∈ Π is a transport plan and to execute the plan, for all x , ymove γ(x , y) mass from x to y .

38 / 107

Page 39: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Understanding the EM distance

Main IdeaProbability distributions are defined by how much mass they puton each point.

Imagine we started with distribution Pr , and wanted to move massaround to change the distribution into Pθ.

Moving mass m by distance d costs m · d effort.

The earth mover distance is the minimal effort we need to spend.

Transport Plan

Each γ ∈ Π is a transport plan and to execute the plan, for all x , ymove γ(x , y) mass from x to y .

39 / 107

Page 40: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Understanding the EM distance

Main IdeaProbability distributions are defined by how much mass they puton each point.

Imagine we started with distribution Pr , and wanted to move massaround to change the distribution into Pθ.

Moving mass m by distance d costs m · d effort.

The earth mover distance is the minimal effort we need to spend.

Transport Plan

Each γ ∈ Π is a transport plan and to execute the plan, for all x , ymove γ(x , y) mass from x to y .

40 / 107

Page 41: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Understanding the EM distance

Main IdeaProbability distributions are defined by how much mass they puton each point.

Imagine we started with distribution Pr , and wanted to move massaround to change the distribution into Pθ.

Moving mass m by distance d costs m · d effort.

The earth mover distance is the minimal effort we need to spend.

Transport Plan

Each γ ∈ Π is a transport plan and to execute the plan, for all x , ymove γ(x , y) mass from x to y .

41 / 107

Page 42: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Understanding the EM distance

Main IdeaProbability distributions are defined by how much mass they puton each point.

Imagine we started with distribution Pr , and wanted to move massaround to change the distribution into Pθ.

Moving mass m by distance d costs m · d effort.

The earth mover distance is the minimal effort we need to spend.

Transport Plan

Each γ ∈ Π is a transport plan and to execute the plan, for all x , ymove γ(x , y) mass from x to y .

42 / 107

Page 43: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Understanding the EM distance

What properties does the plan need to satisfy to transform Pr intoPθ?

The amount of mass that leaves x is∫y γ(x , y)dy . This must equal

Pr (x), the amount of mass originally at x .

The amount of mass that enters y is∫x γ(x , y)dx . This must

equal Pθ(y), the amount of mass that ends up at y.

This shows why the marginals of γ ∈ Π must be Pr and Pθ. Forscoring, the effort spent is∫

x

∫yγ(x , y)||x − y ||dydx = E(x ,y)∼γ [||x − y ||]

Computing the infimum of this over all valid γ gives the earthmover distance

43 / 107

Page 44: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Understanding the EM distance

What properties does the plan need to satisfy to transform Pr intoPθ?

The amount of mass that leaves x is∫y γ(x , y)dy . This must equal

Pr (x), the amount of mass originally at x .

The amount of mass that enters y is∫x γ(x , y)dx . This must

equal Pθ(y), the amount of mass that ends up at y.

This shows why the marginals of γ ∈ Π must be Pr and Pθ. Forscoring, the effort spent is∫

x

∫yγ(x , y)||x − y ||dydx = E(x ,y)∼γ [||x − y ||]

Computing the infimum of this over all valid γ gives the earthmover distance

44 / 107

Page 45: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Understanding the EM distance

What properties does the plan need to satisfy to transform Pr intoPθ?

The amount of mass that leaves x is∫y γ(x , y)dy . This must equal

Pr (x), the amount of mass originally at x .

The amount of mass that enters y is∫x γ(x , y)dx . This must

equal Pθ(y), the amount of mass that ends up at y.

This shows why the marginals of γ ∈ Π must be Pr and Pθ. Forscoring, the effort spent is∫

x

∫yγ(x , y)||x − y ||dydx = E(x ,y)∼γ [||x − y ||]

Computing the infimum of this over all valid γ gives the earthmover distance

45 / 107

Page 46: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Understanding the EM distance

What properties does the plan need to satisfy to transform Pr intoPθ?

The amount of mass that leaves x is∫y γ(x , y)dy . This must equal

Pr (x), the amount of mass originally at x .

The amount of mass that enters y is∫x γ(x , y)dx . This must

equal Pθ(y), the amount of mass that ends up at y.

This shows why the marginals of γ ∈ Π must be Pr and Pθ. Forscoring, the effort spent is∫

x

∫yγ(x , y)||x − y ||dydx = E(x ,y)∼γ [||x − y ||]

Computing the infimum of this over all valid γ gives the earthmover distance

46 / 107

Page 47: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Understanding the EM distance

What properties does the plan need to satisfy to transform Pr intoPθ?

The amount of mass that leaves x is∫y γ(x , y)dy . This must equal

Pr (x), the amount of mass originally at x .

The amount of mass that enters y is∫x γ(x , y)dx . This must

equal Pθ(y), the amount of mass that ends up at y.

This shows why the marginals of γ ∈ Π must be Pr and Pθ. Forscoring, the effort spent is∫

x

∫yγ(x , y)||x − y ||dydx = E(x ,y)∼γ [||x − y ||]

Computing the infimum of this over all valid γ gives the earthmover distance

47 / 107

Page 48: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Understanding the EM distance

What properties does the plan need to satisfy to transform Pr intoPθ?

The amount of mass that leaves x is∫y γ(x , y)dy . This must equal

Pr (x), the amount of mass originally at x .

The amount of mass that enters y is∫x γ(x , y)dx . This must

equal Pθ(y), the amount of mass that ends up at y.

This shows why the marginals of γ ∈ Π must be Pr and Pθ. Forscoring, the effort spent is∫

x

∫yγ(x , y)||x − y ||dydx = E(x ,y)∼γ [||x − y ||]

Computing the infimum of this over all valid γ gives the earthmover distance

48 / 107

Page 49: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Learning Parallel Lines Example

Let Z ∼ U[0, 1] and let P0 be the distribution of (0,Z ) ∈ R2,uniform on a straight vertical line passing through the origin. Nowlet gθ(z) = (θ, z) with θ a single real parameter.

We’d like our optimization algorithm to learn to move θ to 0. Asθ → 0, the distance d(P0,Pθ) should decrease.

49 / 107

Page 50: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Learning Parallel Lines ExampleLet Z ∼ U[0, 1] and let P0 be the distribution of (0,Z ) ∈ R2,uniform on a straight vertical line passing through the origin. Nowlet gθ(z) = (θ, z) with θ a single real parameter.

We’d like our optimization algorithm to learn to move θ to 0. Asθ → 0, the distance d(P0,Pθ) should decrease.

50 / 107

Page 51: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Learning Parallel Lines ExampleLet Z ∼ U[0, 1] and let P0 be the distribution of (0,Z ) ∈ R2,uniform on a straight vertical line passing through the origin. Nowlet gθ(z) = (θ, z) with θ a single real parameter.

We’d like our optimization algorithm to learn to move θ to 0. Asθ → 0, the distance d(P0,Pθ) should decrease.

51 / 107

Page 52: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Learning Parallel Lines ExampleLet Z ∼ U[0, 1] and let P0 be the distribution of (0,Z ) ∈ R2,uniform on a straight vertical line passing through the origin. Nowlet gθ(z) = (θ, z) with θ a single real parameter.

We’d like our optimization algorithm to learn to move θ to 0. Asθ → 0, the distance d(P0,Pθ) should decrease.

52 / 107

Page 53: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Learning Parallel Lines Example

For many common distance functions, this doesn’t happen.

δ(P0,Pθ) =

{1 if θ 6= 0

0 if θ = 0

JS(P0,Pθ) =

{log 2 if θ 6= 0

0 if θ = 0

KL(Pθ,P0) = KL(P0,Pθ) =

{+∞ if θ 6= 0

0 if θ = 0

W (P0,Pθ) = |θ|

53 / 107

Page 54: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Learning Parallel Lines Example

For many common distance functions, this doesn’t happen.

δ(P0,Pθ) =

{1 if θ 6= 0

0 if θ = 0

JS(P0,Pθ) =

{log 2 if θ 6= 0

0 if θ = 0

KL(Pθ,P0) = KL(P0,Pθ) =

{+∞ if θ 6= 0

0 if θ = 0

W (P0,Pθ) = |θ|

54 / 107

Page 55: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Theoretical Justification

Theorem (1)

Let Pr be a fixed distribution over χ. Let Z be a random variableover another space Z. Let g : Z ×Rd → χ be a function, that willbe denote gθ(z) with z the first coordinate and θ the second. LetPθ denote the distribution of gθ(). Then,

1. If g is continuous in θ, so is W (Pr ,Pθ).

2. If g is locally Lipschitz and satisfies regularity assumption 1,then W (Pr ,Pθ) is continuous everywhere, and differentiablealmost everywhere

3. Statements 1-2 are false for the Jensen-Shannon divergenceJS(Pr ,Pθ) and all the KLs.

55 / 107

Page 56: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Theoretical Justification

Theorem (2)

Let P be a distribution on a compact space χ and (Pn)n∈N be asequence of distributions on χ. Then, considering all limits asn→∞,

1. The following statements are equivalentI δ(Pn,P)→ 0I JS(Pn,P)→ 0

2. The following statements are equivalentI W (Pn,P)→ 0

I PnD→ P where

D→ represents convergence in distribution forrandom variables

3. KL(Pn||P)→ 0 or KL(P||Pn)→ 0 imply the statements in (1).

4. The statements in (1) imply the statements in (2).

56 / 107

Page 57: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Introduction

Different Distances

Standard GAN

Wasserstein GAN

Improved Wasserstein GAN

57 / 107

Page 58: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Generative Adversarial Networks

Recall that the GAN training strategy is to define a game betweentwo competing networks.

The generator network G maps a source of noise to the inputspace.

The discriminator network D receives either a generated sample ora true data sample and must distinguish between the two.

The generator is trained to fool the discriminator.

58 / 107

Page 59: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Generative Adversarial Networks

Recall that the GAN training strategy is to define a game betweentwo competing networks.

The generator network G maps a source of noise to the inputspace.

The discriminator network D receives either a generated sample ora true data sample and must distinguish between the two.

The generator is trained to fool the discriminator.

59 / 107

Page 60: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Generative Adversarial Networks

Recall that the GAN training strategy is to define a game betweentwo competing networks.

The generator network G maps a source of noise to the inputspace.

The discriminator network D receives either a generated sample ora true data sample and must distinguish between the two.

The generator is trained to fool the discriminator.

60 / 107

Page 61: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Generative Adversarial Networks

Recall that the GAN training strategy is to define a game betweentwo competing networks.

The generator network G maps a source of noise to the inputspace.

The discriminator network D receives either a generated sample ora true data sample and must distinguish between the two.

The generator is trained to fool the discriminator.

61 / 107

Page 62: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Generative Adversarial Networks

Formally we can express the game between the generator G andthe discriminator D with the minimax objective:

minG

maxD

Ex∼Pr [log(D(x))] + Ex̃∼Pg[log(1− D(x̃)]

where Pr is the data distribution and Pg is the model distributionimplicitly defined by x̃ = G (z), z ∼ p(z)

62 / 107

Page 63: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Generative Adversarial Networks

Remarks

I If the discriminator is trained to optimality before eachgenerator parameter update, minimizing the value functionamounts to minimizing the Jensen-Shannon divergencebetween the data and model distributions on x

I This is expensive and often leads to vanishing gradients as thediscriminator saturates

I In practice, this requirement is relaxed, and the generator anddiscriminator are update simultaneously

63 / 107

Page 64: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Generative Adversarial Networks

Remarks

I If the discriminator is trained to optimality before eachgenerator parameter update, minimizing the value functionamounts to minimizing the Jensen-Shannon divergencebetween the data and model distributions on x

I This is expensive and often leads to vanishing gradients as thediscriminator saturates

I In practice, this requirement is relaxed, and the generator anddiscriminator are update simultaneously

64 / 107

Page 65: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Generative Adversarial Networks

Remarks

I If the discriminator is trained to optimality before eachgenerator parameter update, minimizing the value functionamounts to minimizing the Jensen-Shannon divergencebetween the data and model distributions on x

I This is expensive and often leads to vanishing gradients as thediscriminator saturates

I In practice, this requirement is relaxed, and the generator anddiscriminator are update simultaneously

65 / 107

Page 66: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Generative Adversarial Networks

Remarks

I If the discriminator is trained to optimality before eachgenerator parameter update, minimizing the value functionamounts to minimizing the Jensen-Shannon divergencebetween the data and model distributions on x

I This is expensive and often leads to vanishing gradients as thediscriminator saturates

I In practice, this requirement is relaxed, and the generator anddiscriminator are update simultaneously

66 / 107

Page 67: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Introduction

Different Distances

Standard GAN

Wasserstein GAN

Improved Wasserstein GAN

67 / 107

Page 68: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Kantorivich-Rubinstein Duality

Unfortunately, computing the Wasserstein distance exactly isintractable.

W (Pr ,Pθ) = infγ∈Π((Pr ,Pθ)

E(x ,y)∼γ [||x − y ||]

However, a result from the Kantorivich-Rubinstein Duality (Villani2008) shows W is equivalent to

W (Pr ,Pθ) = sup||f ||L≤1

Ex∼Pr [f (x)]− Ex∼Pθ[f (x)]

where the supremum is taken over all 1-Lipschitz functions

Calculating this is still intractable, but now it’s easier toapproximate.

68 / 107

Page 69: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Kantorivich-Rubinstein Duality

Unfortunately, computing the Wasserstein distance exactly isintractable.

W (Pr ,Pθ) = infγ∈Π((Pr ,Pθ)

E(x ,y)∼γ [||x − y ||]

However, a result from the Kantorivich-Rubinstein Duality (Villani2008) shows W is equivalent to

W (Pr ,Pθ) = sup||f ||L≤1

Ex∼Pr [f (x)]− Ex∼Pθ[f (x)]

where the supremum is taken over all 1-Lipschitz functions

Calculating this is still intractable, but now it’s easier toapproximate.

69 / 107

Page 70: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Kantorivich-Rubinstein Duality

Unfortunately, computing the Wasserstein distance exactly isintractable.

W (Pr ,Pθ) = infγ∈Π((Pr ,Pθ)

E(x ,y)∼γ [||x − y ||]

However, a result from the Kantorivich-Rubinstein Duality (Villani2008) shows W is equivalent to

W (Pr ,Pθ) = sup||f ||L≤1

Ex∼Pr [f (x)]− Ex∼Pθ[f (x)]

where the supremum is taken over all 1-Lipschitz functions

Calculating this is still intractable, but now it’s easier toapproximate.

70 / 107

Page 71: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Wasserstein GAN Approximation

Note that if we replace the supremum over 1-Lipschitz functionswith the supremum over K -Lipschitz functions, then the supremumis K ·W (Pr ,Pθ) instead.

Suppose we have a parametrized function family {fw}w∈W , wherew are the weights and W is the set of all possible weights

Furthermore suppose these functions are all K -Lipschitz for someK . Then we have

maxw∈W

Ex∼Pr [fw (x)]−Ex∼Pθ[fw (x)] ≤ sup

||f ||L≤KEx∼Pr [f (x)]−Ex∼Pθ

[f (x)]

= K ·W (Pr ,Pθ)

If {fw}w∈W contains the true supremum among K -Lipschitzfunctions, this gives the distance exactly.

In practice this won’t be true!

71 / 107

Page 72: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Wasserstein GAN Approximation

Note that if we replace the supremum over 1-Lipschitz functionswith the supremum over K -Lipschitz functions, then the supremumis K ·W (Pr ,Pθ) instead.

Suppose we have a parametrized function family {fw}w∈W , wherew are the weights and W is the set of all possible weights

Furthermore suppose these functions are all K -Lipschitz for someK . Then we have

maxw∈W

Ex∼Pr [fw (x)]−Ex∼Pθ[fw (x)] ≤ sup

||f ||L≤KEx∼Pr [f (x)]−Ex∼Pθ

[f (x)]

= K ·W (Pr ,Pθ)

If {fw}w∈W contains the true supremum among K -Lipschitzfunctions, this gives the distance exactly.

In practice this won’t be true!

72 / 107

Page 73: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Wasserstein GAN Approximation

Note that if we replace the supremum over 1-Lipschitz functionswith the supremum over K -Lipschitz functions, then the supremumis K ·W (Pr ,Pθ) instead.

Suppose we have a parametrized function family {fw}w∈W , wherew are the weights and W is the set of all possible weights

Furthermore suppose these functions are all K -Lipschitz for someK .

Then we have

maxw∈W

Ex∼Pr [fw (x)]−Ex∼Pθ[fw (x)] ≤ sup

||f ||L≤KEx∼Pr [f (x)]−Ex∼Pθ

[f (x)]

= K ·W (Pr ,Pθ)

If {fw}w∈W contains the true supremum among K -Lipschitzfunctions, this gives the distance exactly.

In practice this won’t be true!

73 / 107

Page 74: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Wasserstein GAN Approximation

Note that if we replace the supremum over 1-Lipschitz functionswith the supremum over K -Lipschitz functions, then the supremumis K ·W (Pr ,Pθ) instead.

Suppose we have a parametrized function family {fw}w∈W , wherew are the weights and W is the set of all possible weights

Furthermore suppose these functions are all K -Lipschitz for someK . Then we have

maxw∈W

Ex∼Pr [fw (x)]−Ex∼Pθ[fw (x)] ≤ sup

||f ||L≤KEx∼Pr [f (x)]−Ex∼Pθ

[f (x)]

= K ·W (Pr ,Pθ)

If {fw}w∈W contains the true supremum among K -Lipschitzfunctions, this gives the distance exactly.

In practice this won’t be true!

74 / 107

Page 75: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Wasserstein GAN Approximation

Note that if we replace the supremum over 1-Lipschitz functionswith the supremum over K -Lipschitz functions, then the supremumis K ·W (Pr ,Pθ) instead.

Suppose we have a parametrized function family {fw}w∈W , wherew are the weights and W is the set of all possible weights

Furthermore suppose these functions are all K -Lipschitz for someK . Then we have

maxw∈W

Ex∼Pr [fw (x)]−Ex∼Pθ[fw (x)] ≤ sup

||f ||L≤KEx∼Pr [f (x)]−Ex∼Pθ

[f (x)]

= K ·W (Pr ,Pθ)

If {fw}w∈W contains the true supremum among K -Lipschitzfunctions, this gives the distance exactly.

In practice this won’t be true!

75 / 107

Page 76: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Wasserstein GAN Approximation

Note that if we replace the supremum over 1-Lipschitz functionswith the supremum over K -Lipschitz functions, then the supremumis K ·W (Pr ,Pθ) instead.

Suppose we have a parametrized function family {fw}w∈W , wherew are the weights and W is the set of all possible weights

Furthermore suppose these functions are all K -Lipschitz for someK . Then we have

maxw∈W

Ex∼Pr [fw (x)]−Ex∼Pθ[fw (x)] ≤ sup

||f ||L≤KEx∼Pr [f (x)]−Ex∼Pθ

[f (x)]

= K ·W (Pr ,Pθ)

If {fw}w∈W contains the true supremum among K -Lipschitzfunctions, this gives the distance exactly.

In practice this won’t be true!

76 / 107

Page 77: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Wasserstein GAN Algorithm

Looping all this back to generative models, we’d like to trainPθ = gθ(z) to match Pr .

Intuitively, given a fixed gθ, we can compute the optimal fw for theWasserstein distance.

We can then backpropagate through W (Pr , gθ(Z )) to get thegradient for θ.

∇θW (Pr ,Pθ) = ∇θ(Ex∼Pr [fw (x)]− Ez∼Z [fw (gθ(z))])

= −Ez∼Z [∇θfw (gθ(z))]

77 / 107

Page 78: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Wasserstein GAN Algorithm

Looping all this back to generative models, we’d like to trainPθ = gθ(z) to match Pr .

Intuitively, given a fixed gθ, we can compute the optimal fw for theWasserstein distance.

We can then backpropagate through W (Pr , gθ(Z )) to get thegradient for θ.

∇θW (Pr ,Pθ) = ∇θ(Ex∼Pr [fw (x)]− Ez∼Z [fw (gθ(z))])

= −Ez∼Z [∇θfw (gθ(z))]

78 / 107

Page 79: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Wasserstein GAN Algorithm

Looping all this back to generative models, we’d like to trainPθ = gθ(z) to match Pr .

Intuitively, given a fixed gθ, we can compute the optimal fw for theWasserstein distance.

We can then backpropagate through W (Pr , gθ(Z )) to get thegradient for θ.

∇θW (Pr ,Pθ) = ∇θ(Ex∼Pr [fw (x)]− Ez∼Z [fw (gθ(z))])

= −Ez∼Z [∇θfw (gθ(z))]

79 / 107

Page 80: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Wasserstein GAN Algorithm

The training process now has three steps:

1. For a fixed θ, compute an approximation of W (Pr ,Pθ) bytraining fw to convergence

2. Once we have the optimal fw , compute the θ gradient−Ez∼Z [∇θfw (gθ(z))] by sampling several z ∼ Z

3. Update θ, and repeat the process

Important Detail

The entire derivation only works when the function family{fw}w∈W is K -Lipschitz.

To guarantee this is true, the authors use weight clamping. Theweights w are constrained to lie within [−c , c], by clipping w afterevery update to w .

80 / 107

Page 81: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Wasserstein GAN Algorithm

The training process now has three steps:

1. For a fixed θ, compute an approximation of W (Pr ,Pθ) bytraining fw to convergence

2. Once we have the optimal fw , compute the θ gradient−Ez∼Z [∇θfw (gθ(z))] by sampling several z ∼ Z

3. Update θ, and repeat the process

Important Detail

The entire derivation only works when the function family{fw}w∈W is K -Lipschitz.

To guarantee this is true, the authors use weight clamping. Theweights w are constrained to lie within [−c , c], by clipping w afterevery update to w .

81 / 107

Page 82: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Wasserstein GAN Algorithm

The training process now has three steps:

1. For a fixed θ, compute an approximation of W (Pr ,Pθ) bytraining fw to convergence

2. Once we have the optimal fw , compute the θ gradient−Ez∼Z [∇θfw (gθ(z))] by sampling several z ∼ Z

3. Update θ, and repeat the process

Important Detail

The entire derivation only works when the function family{fw}w∈W is K -Lipschitz.

To guarantee this is true, the authors use weight clamping. Theweights w are constrained to lie within [−c , c], by clipping w afterevery update to w .

82 / 107

Page 83: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Wasserstein GAN Algorithm

The training process now has three steps:

1. For a fixed θ, compute an approximation of W (Pr ,Pθ) bytraining fw to convergence

2. Once we have the optimal fw , compute the θ gradient−Ez∼Z [∇θfw (gθ(z))] by sampling several z ∼ Z

3. Update θ, and repeat the process

Important Detail

The entire derivation only works when the function family{fw}w∈W is K -Lipschitz.

To guarantee this is true, the authors use weight clamping. Theweights w are constrained to lie within [−c , c], by clipping w afterevery update to w .

83 / 107

Page 84: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Wasserstein GAN Algorithm

The training process now has three steps:

1. For a fixed θ, compute an approximation of W (Pr ,Pθ) bytraining fw to convergence

2. Once we have the optimal fw , compute the θ gradient−Ez∼Z [∇θfw (gθ(z))] by sampling several z ∼ Z

3. Update θ, and repeat the process

Important Detail

The entire derivation only works when the function family{fw}w∈W is K -Lipschitz.

To guarantee this is true, the authors use weight clamping. Theweights w are constrained to lie within [−c , c], by clipping w afterevery update to w .

84 / 107

Page 85: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Wasserstein GAN Algorithm

The training process now has three steps:

1. For a fixed θ, compute an approximation of W (Pr ,Pθ) bytraining fw to convergence

2. Once we have the optimal fw , compute the θ gradient−Ez∼Z [∇θfw (gθ(z))] by sampling several z ∼ Z

3. Update θ, and repeat the process

Important Detail

The entire derivation only works when the function family{fw}w∈W is K -Lipschitz.

To guarantee this is true, the authors use weight clamping. Theweights w are constrained to lie within [−c , c], by clipping w afterevery update to w .

85 / 107

Page 86: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Wasserstein GAN Algorithm

86 / 107

Page 87: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Comparison with Standard GANs

I In GANs, the discriminator maximizes

1

m

m∑i=1

logD(x (i)) +1

m

m∑i=1

log(1− D(gθ(z(i))))

where we constrain D(x) to always be a probability p ∈ (0, 1)

I In WGANs, nothing requires the fw to output a probabilityand hence it is referred to as a critic instead of a discriminator

I Although GANs are formulated as a min max problem, inpractice we never train D to convergence

I Consequently, we’re updating G against an objective that kindof aims towards the JS divergence, but doesn’t go all the way

I In constrast, because the Wasserstein distance is differentiablenearly everywhere, we can (and should) train fw toconvergence before each generator update, to get as accuratean estimate of W (Pr ,Pθ) as possible

87 / 107

Page 88: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Comparison with Standard GANs

I In GANs, the discriminator maximizes

1

m

m∑i=1

logD(x (i)) +1

m

m∑i=1

log(1− D(gθ(z(i))))

where we constrain D(x) to always be a probability p ∈ (0, 1)

I In WGANs, nothing requires the fw to output a probabilityand hence it is referred to as a critic instead of a discriminator

I Although GANs are formulated as a min max problem, inpractice we never train D to convergence

I Consequently, we’re updating G against an objective that kindof aims towards the JS divergence, but doesn’t go all the way

I In constrast, because the Wasserstein distance is differentiablenearly everywhere, we can (and should) train fw toconvergence before each generator update, to get as accuratean estimate of W (Pr ,Pθ) as possible

88 / 107

Page 89: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Comparison with Standard GANs

I In GANs, the discriminator maximizes

1

m

m∑i=1

logD(x (i)) +1

m

m∑i=1

log(1− D(gθ(z(i))))

where we constrain D(x) to always be a probability p ∈ (0, 1)

I In WGANs, nothing requires the fw to output a probabilityand hence it is referred to as a critic instead of a discriminator

I Although GANs are formulated as a min max problem, inpractice we never train D to convergence

I Consequently, we’re updating G against an objective that kindof aims towards the JS divergence, but doesn’t go all the way

I In constrast, because the Wasserstein distance is differentiablenearly everywhere, we can (and should) train fw toconvergence before each generator update, to get as accuratean estimate of W (Pr ,Pθ) as possible

89 / 107

Page 90: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Comparison with Standard GANs

I In GANs, the discriminator maximizes

1

m

m∑i=1

logD(x (i)) +1

m

m∑i=1

log(1− D(gθ(z(i))))

where we constrain D(x) to always be a probability p ∈ (0, 1)

I In WGANs, nothing requires the fw to output a probabilityand hence it is referred to as a critic instead of a discriminator

I Although GANs are formulated as a min max problem, inpractice we never train D to convergence

I Consequently, we’re updating G against an objective that kindof aims towards the JS divergence, but doesn’t go all the way

I In constrast, because the Wasserstein distance is differentiablenearly everywhere, we can (and should) train fw toconvergence before each generator update, to get as accuratean estimate of W (Pr ,Pθ) as possible

90 / 107

Page 91: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Comparison with Standard GANs

I In GANs, the discriminator maximizes

1

m

m∑i=1

logD(x (i)) +1

m

m∑i=1

log(1− D(gθ(z(i))))

where we constrain D(x) to always be a probability p ∈ (0, 1)

I In WGANs, nothing requires the fw to output a probabilityand hence it is referred to as a critic instead of a discriminator

I Although GANs are formulated as a min max problem, inpractice we never train D to convergence

I Consequently, we’re updating G against an objective that kindof aims towards the JS divergence, but doesn’t go all the way

I In constrast, because the Wasserstein distance is differentiablenearly everywhere, we can (and should) train fw toconvergence before each generator update, to get as accuratean estimate of W (Pr ,Pθ) as possible

91 / 107

Page 92: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Comparison with Standard GANs

I In GANs, the discriminator maximizes

1

m

m∑i=1

logD(x (i)) +1

m

m∑i=1

log(1− D(gθ(z(i))))

where we constrain D(x) to always be a probability p ∈ (0, 1)

I In WGANs, nothing requires the fw to output a probabilityand hence it is referred to as a critic instead of a discriminator

I Although GANs are formulated as a min max problem, inpractice we never train D to convergence

I Consequently, we’re updating G against an objective that kindof aims towards the JS divergence, but doesn’t go all the way

I In constrast, because the Wasserstein distance is differentiablenearly everywhere, we can (and should) train fw toconvergence before each generator update, to get as accuratean estimate of W (Pr ,Pθ) as possible

92 / 107

Page 93: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Introduction

Different Distances

Standard GAN

Wasserstein GAN

Improved Wasserstein GAN

93 / 107

Page 94: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Properties of optimal WGAN critic

An open question is how to effectively enforce the Lipschitzconstraint on the critic?

Previously (Arjovksy et. al 2017) we’ve seen that you can clip theweights of the critic to lie within the compact space [−c , c]

To understand why weight clipping is problematic in WGAN criticwe need to understand what are the properties of the optimalWGAN critic?

94 / 107

Page 95: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Properties of optimal WGAN critic

An open question is how to effectively enforce the Lipschitzconstraint on the critic?

Previously (Arjovksy et. al 2017) we’ve seen that you can clip theweights of the critic to lie within the compact space [−c , c]

To understand why weight clipping is problematic in WGAN criticwe need to understand what are the properties of the optimalWGAN critic?

95 / 107

Page 96: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Properties of optimal WGAN critic

An open question is how to effectively enforce the Lipschitzconstraint on the critic?

Previously (Arjovksy et. al 2017) we’ve seen that you can clip theweights of the critic to lie within the compact space [−c , c]

To understand why weight clipping is problematic in WGAN criticwe need to understand what are the properties of the optimalWGAN critic?

96 / 107

Page 97: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Properties of optimal WGAN critic

An open question is how to effectively enforce the Lipschitzconstraint on the critic?

Previously (Arjovksy et. al 2017) we’ve seen that you can clip theweights of the critic to lie within the compact space [−c , c]

To understand why weight clipping is problematic in WGAN criticwe need to understand what are the properties of the optimalWGAN critic?

97 / 107

Page 98: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Properties of optimal WGAN critic

If the optimal critic under the Kantorovich-Rubinstein dual D∗ isdifferentiable, and x is a point from our generator distribution Pθ,then there is a point y sampled from the true distribution Pr suchthat the gradient of D∗ at all points xt = (1− t)x + ty lie on astraight line between x and y.

In other words, ∇D∗(xt) = y−xt||y−xt || .

This implies that the optimal WGAN critic has gradientswith norm 1 almost everywhere under Pr and Pθ

98 / 107

Page 99: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Properties of optimal WGAN critic

If the optimal critic under the Kantorovich-Rubinstein dual D∗ isdifferentiable, and x is a point from our generator distribution Pθ,then there is a point y sampled from the true distribution Pr suchthat the gradient of D∗ at all points xt = (1− t)x + ty lie on astraight line between x and y.

In other words, ∇D∗(xt) = y−xt||y−xt || .

This implies that the optimal WGAN critic has gradientswith norm 1 almost everywhere under Pr and Pθ

99 / 107

Page 100: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Properties of optimal WGAN critic

If the optimal critic under the Kantorovich-Rubinstein dual D∗ isdifferentiable, and x is a point from our generator distribution Pθ,then there is a point y sampled from the true distribution Pr suchthat the gradient of D∗ at all points xt = (1− t)x + ty lie on astraight line between x and y.

In other words, ∇D∗(xt) = y−xt||y−xt || .

This implies that the optimal WGAN critic has gradientswith norm 1 almost everywhere under Pr and Pθ

100 / 107

Page 101: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Properties of optimal WGAN critic

If the optimal critic under the Kantorovich-Rubinstein dual D∗ isdifferentiable, and x is a point from our generator distribution Pθ,then there is a point y sampled from the true distribution Pr suchthat the gradient of D∗ at all points xt = (1− t)x + ty lie on astraight line between x and y.

In other words, ∇D∗(xt) = y−xt||y−xt || .

This implies that the optimal WGAN critic has gradientswith norm 1 almost everywhere under Pr and Pθ

101 / 107

Page 102: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Gradient penalty

We consider an alternative method to enforce the Lipschitzconstraint on the training objective.

A differentiable function is 1-Lipschitz if and only if it hasgradients with norm less than or equal to 1 everywhere.

This implies we should directly constrain the gradient norm of ourcritic function with respect to its input.

Enforcing a soft version of this we get:

102 / 107

Page 103: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Gradient penalty

We consider an alternative method to enforce the Lipschitzconstraint on the training objective.

A differentiable function is 1-Lipschitz if and only if it hasgradients with norm less than or equal to 1 everywhere.

This implies we should directly constrain the gradient norm of ourcritic function with respect to its input.

Enforcing a soft version of this we get:

103 / 107

Page 104: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Gradient penalty

We consider an alternative method to enforce the Lipschitzconstraint on the training objective.

A differentiable function is 1-Lipschitz if and only if it hasgradients with norm less than or equal to 1 everywhere.

This implies we should directly constrain the gradient norm of ourcritic function with respect to its input.

Enforcing a soft version of this we get:

104 / 107

Page 105: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Gradient penalty

We consider an alternative method to enforce the Lipschitzconstraint on the training objective.

A differentiable function is 1-Lipschitz if and only if it hasgradients with norm less than or equal to 1 everywhere.

This implies we should directly constrain the gradient norm of ourcritic function with respect to its input.

Enforcing a soft version of this we get:

105 / 107

Page 106: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

Gradient penalty

We consider an alternative method to enforce the Lipschitzconstraint on the training objective.

A differentiable function is 1-Lipschitz if and only if it hasgradients with norm less than or equal to 1 everywhere.

This implies we should directly constrain the gradient norm of ourcritic function with respect to its input.

Enforcing a soft version of this we get:

106 / 107

Page 107: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks

WGAN with gradient penalty

107 / 107