generative adversarial networks (gans) - emanuele sansone · list of recent papers on generative...

Post on 25-May-2020

25 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Emanuele Sansone

Generative Adversarial Networks (GANs)

29 march 2017Website: emsansone.github.io

Goals

Stimulate discussion

1. Why not using this model? 2. What don’t you like in this model? 3. Why not improving it? 4. …

Sketch of GANs

Density estimation in high-dimensional data (manifold assumption)

Ian Goodfellow, NIPS 2016 tutorial on GANs

Yoshua Bengio, MLSS 2015 Austin TX

Sketch of GANs

Training Datag✓(·)g

D�(·)D

[0, 1]

z

pg px

pz

“Generative Adversarial Networks” ICLR 2014 Ian Goodfellow

Training Datag✓(·)g

D�(·)D

[0, 1]

Sketch of GANs

z

pg

1. Fast sample generation - discriminative function (no Markov chains)

2. Impl ic i t defin i t ion o f density families

px

pz

GENERATIVE

Training Datag✓(·)g

D�(·)D

[0, 1]

Sketch of GANs

z

pg px

pz

ADVERSARIAL

SEE LATER

Game between two adversaries (no likelihood maximization)

Training Datag✓(·)g

D�(·)D

[0, 1]

Sketch of GANs

z

pg px

pz

NETWORKS

“Learning Deep Architectures for AI” Joshua Bengio 2009

Exploitation of deep neural architectures

(High capacity and efficient training…)

“Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks” arXiv 2015 Alec Radford, Luke Metz, Soumith Chintala

Application - Image synthesis

Application - Video synthesis

“Generating Videos with Scene Dynamics” NIPS 2016 Carl Vondrick, Hamed Pirsiavash, Antonio Torralba

Beach Golf Train Baby

Hallucinated videos

Input Output Input Output Input Output

Conditional generation of videos

Application - Representation learning 1

“InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets” NIPS 2016 Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, Pieter Abbeel

Application - Representation Learning 2

“InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets” NIPS 2016 Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, Pieter Abbeel

    List   of   recent   papers   on   Generative   Adversarial   Networks:   

PAPERs      about   GANs   2014­2016 

Generative   Adversarial   Networks  ICLR   2014 

Deep   Generative   Image   Models   Using   a   Laplacian   Pyramid   of   Adversarial 

Networks   (UNSUPERVISED) 

NIPS   2015 

Draw:   A   Recurrent   Neural   Network   for   Image   Generation   (UNSUPERVISED)  ICML   2015 

Unsupervised   Representation   Learning   with   Deep   Convolutional   Generative 

Adversarial   Networks 

arXiv   2015 

InfoGAN:   Interpretable   Representation   Learning   by   Information   Maximizing 

Generative   Adversarial   Nets 

NIPS   2016 

Towards   Principled   Unsupervised   Learning  ICLR   2016   (workshop) 

Improved   Techniques   for   Training   GANs  NIPS   2016 

Unsupervised   and   Semi­Supervised   Learning   with   Categorical   Generative 

Adversarial   Networks 

ICLR   2016 

Generating   Videos   with   Scene   Dynamics  NIPS   2016 

NIPS   2016   Tutorial:   Generative   Adversarial   Networks    PAPER+VIDEO  NIPS   2016 

PAPERs   about   GANs   accepted   to   ICLR   2017 

Towards   Principled   Methods   for   Training   Generative   Adversarial   Networks   ( FIRST THEORETICAL   PAPER) 

ICLR   2017 

Adversarially   Learned   Inference   ( SoA   in   SEMI­SUPERVISED   LEARNING )  ICLR   2017 

Learning   to   Generate   Samples   from   Noise   Through   Infusion   Training  ICLR   2017 

Improving   Generative   Adversarial   Networks   with   Denoising   Feature   Matching  ICLR   2017 

LR­GAN:   Layered   Recursive   Generative   Adversarial   Networks   for   Image 

Generation 

ICLR   2017 

Mode   Regularized   Generative   Adversarial   Networks  ICLR   2017 

Generative   Models   and   Model   Criticism   via   Optimized   Maximum   Mean 

Discrepancy 

ICLR   2017 

Calibrating   Energy­Based   Generative   Adversarial   Networks  ICLR   2017 

Unrolled   Generative   Adversarial   Networks  ICLR   2017 

Generative   Multi­Adversarial   Networks  ICLR   2017 

Energy­Based   Generative   Adversarial   Networks  ICLR   2017 

Probably hot topic?

Emanuele Sansone

Generative Adversarial Networks (GANs)

29 march 2017Website: emsansone.github.io

Problem formulation

min

max

E

x⇠px

n

log(D

(x))

o

+ E

z⇠pz

n

log(1�D�

(g

(z)))

o

Problem formulation

min

max

E

x⇠px

n

log(D

(x))

o

+ E

z⇠pz

n

log(1�D�

(g

(z)))

o

Training Data Generated Data

Problem formulation

min

max

E

x⇠px

n

log(D

(x))

o

+ E

z⇠pz

n

log(1�D�

(g

(z)))

o

Optimal Discriminator (Proof sketch)

min

max

E

x⇠px

n

log(D

(x))

o

+ E

x⇠pg

n

log(1�D�

(x))

o

Zp

x

(x) log(D

(x))dx +

Zp

g

(x) log(1�D�

(x))dx

Z

n

p

x

(x) log(D

(x)) + p

g

(x) log(1�D�

(x))

o

dx

a log(y) + b log(1� y)

8(a, b) 2 R2 \ {0, 0} @a log(y) + b log(1� y)@y

= 0() y ⇤ =a

a + b

D

⇤�

(x) =p

x

(x)

p

x

(x) + pg

(x)

Problem formulation

min

max

E

x⇠px

n

log(D

(x))

o

+ E

z⇠pz

n

log(1�D�

(g

(z)))

o

Optimal Discriminator (Proof sketch)

min

max

E

x⇠px

n

log(D

(x))

o

+ E

x⇠pg

n

log(1�D�

(x))

o

Zp

x

(x) log(D

(x))dx +

Zp

g

(x) log(1�D�

(x))dx

Z

n

p

x

(x) log(D

(x)) + p

g

(x) log(1�D�

(x))

o

dx

a log(y) + b log(1� y)

8(a, b) 2 R2 \ {0, 0} @a log(y) + b log(1� y)@y

= 0() y ⇤ =a

a + b

D

⇤�

(x) =p

x

(x)

p

x

(x) + pg

(x)

Problem formulation

min

max

E

x⇠px

n

log(D

(x))

o

+ E

z⇠pz

n

log(1�D�

(g

(z)))

o

Optimal Discriminator (Proof sketch)

min

max

E

x⇠px

n

log(D

(x))

o

+ E

x⇠pg

n

log(1�D�

(x))

o

Zp

x

(x) log(D

(x))dx +

Zp

g

(x) log(1�D�

(x))dx

Z

n

p

x

(x) log(D

(x)) + p

g

(x) log(1�D�

(x))

o

dx

a log(y) + b log(1� y)

8(a, b) 2 R2 \ {0, 0} @a log(y) + b log(1� y)@y

= 0() y ⇤ =a

a + b

D

⇤�

(x) =p

x

(x)

p

x

(x) + pg

(x)

Problem formulation

min

max

E

x⇠px

n

log(D

(x))

o

+ E

z⇠pz

n

log(1�D�

(g

(z)))

o

Optimal Discriminator (Proof sketch)

min

max

E

x⇠px

n

log(D

(x))

o

+ E

x⇠pg

n

log(1�D�

(x))

o

Zp

x

(x) log(D

(x))dx +

Zp

g

(x) log(1�D�

(x))dx

Z

n

p

x

(x) log(D

(x)) + p

g

(x) log(1�D�

(x))

o

dx

a log(y) + b log(1� y)

8(a, b) 2 R2 \ {0, 0} @a log(y) + b log(1� y)@y

= 0() y ⇤ =a

a + b

D

⇤�

(x) =p

x

(x)

p

x

(x) + pg

(x)

Problem formulation

min

max

E

x⇠px

n

log(D

(x))

o

+ E

z⇠pz

n

log(1�D�

(g

(z)))

o

Optimal Discriminator (Proof sketch)

min

max

E

x⇠px

n

log(D

(x))

o

+ E

x⇠pg

n

log(1�D�

(x))

o

Zp

x

(x) log(D

(x))dx +

Zp

g

(x) log(1�D�

(x))dx

Z

n

p

x

(x) log(D

(x)) + p

g

(x) log(1�D�

(x))

o

dx

a log(y) + b log(1� y)

8(a, b) 2 R2 \ {0, 0} @a log(y) + b log(1� y)@y

= 0() y ⇤ =a

a + b

D

⇤�

(x) =p

x

(x)

p

x

(x) + pg

(x)

Problem formulation

min

max

E

x⇠px

n

log(D

(x))

o

+ E

z⇠pz

n

log(1�D�

(g

(z)))

o

Optimal Discriminator (Proof sketch)

min

max

E

x⇠px

n

log(D

(x))

o

+ E

x⇠pg

n

log(1�D�

(x))

o

Zp

x

(x) log(D

(x))dx +

Zp

g

(x) log(1�D�

(x))dx

Z

n

p

x

(x) log(D

(x)) + p

g

(x) log(1�D�

(x))

o

dx

a log(y) + b log(1� y)

8(a, b) 2 R2 \ {0, 0} @a log(y) + b log(1� y)@y

= 0() y ⇤ =a

a + b

D

⇤�

(x) =p

x

(x)

p

x

(x) + pg

(x)

Problem formulation

min

max

E

x⇠px

n

log(D

(x))

o

+ E

z⇠pz

n

log(1�D�

(g

(z)))

o

Optimal Generator (Proof sketch)

min

max

E

x⇠px

n

log(D

(x))

o

+ E

x⇠pg

n

log(1�D�

(x))

o

Zp

x

(x) log(D

(x))dx +

Zp

g

(x) log(1�D�

(x))dx

D

⇤�

(x) =p

x

(x)

p

x

(x) + pg

(x)

Zp

x

(x) log

✓p

x

(x)

p

x

(x) + p

g

(x)

◆dx +

Zp

g

(x) log

✓1�

p

x

(x)

p

x

(x) + p

g

(x)

◆dx

Zp

x

(x) log

✓p

x

(x)

p

x

(x) + p

g

(x)

◆dx +

Zp

g

(x) log

✓p

g

(x)

p

x

(x) + p

g

(x)

◆dx

Zp

x

(x) log

✓p

x

(x)

p

x

(x)+pg

(x)2

◆dx +

Zp

g

(x) log

✓p

g

(x)

p

x

(x)+pg

(x)2

◆dx � 2 log(2)

2JSD(px

, pg

)

The minimum value is �2log(2) when JSD(px

, pg

) = 0 () p

g

(x) = px

(x),8x

Problem formulation

min

max

E

x⇠px

n

log(D

(x))

o

+ E

z⇠pz

n

log(1�D�

(g

(z)))

o

Optimal Generator (Proof sketch)

min

max

E

x⇠px

n

log(D

(x))

o

+ E

x⇠pg

n

log(1�D�

(x))

o

Zp

x

(x) log(D

(x))dx +

Zp

g

(x) log(1�D�

(x))dx

D

⇤�

(x) =p

x

(x)

p

x

(x) + pg

(x)

Zp

x

(x) log

✓p

x

(x)

p

x

(x) + p

g

(x)

◆dx +

Zp

g

(x) log

✓1�

p

x

(x)

p

x

(x) + p

g

(x)

◆dx

Zp

x

(x) log

✓p

x

(x)

p

x

(x) + p

g

(x)

◆dx +

Zp

g

(x) log

✓p

g

(x)

p

x

(x) + p

g

(x)

◆dx

Zp

x

(x) log

✓p

x

(x)

p

x

(x)+pg

(x)2

◆dx +

Zp

g

(x) log

✓p

g

(x)

p

x

(x)+pg

(x)2

◆dx � 2 log(2)

2JSD(px

, pg

)

The minimum value is �2log(2) when JSD(px

, pg

) = 0 () p

g

(x) = px

(x),8x

Problem formulation

min

max

E

x⇠px

n

log(D

(x))

o

+ E

z⇠pz

n

log(1�D�

(g

(z)))

o

Optimal Generator (Proof sketch)

min

max

E

x⇠px

n

log(D

(x))

o

+ E

x⇠pg

n

log(1�D�

(x))

o

Zp

x

(x) log(D

(x))dx +

Zp

g

(x) log(1�D�

(x))dx

D

⇤�

(x) =p

x

(x)

p

x

(x) + pg

(x)

Zp

x

(x) log

✓p

x

(x)

p

x

(x) + p

g

(x)

◆dx +

Zp

g

(x) log

✓1�

p

x

(x)

p

x

(x) + p

g

(x)

◆dx

Zp

x

(x) log

✓p

x

(x)

p

x

(x) + p

g

(x)

◆dx +

Zp

g

(x) log

✓p

g

(x)

p

x

(x) + p

g

(x)

◆dx

Zp

x

(x) log

✓p

x

(x)

p

x

(x)+pg

(x)2

◆dx +

Zp

g

(x) log

✓p

g

(x)

p

x

(x)+pg

(x)2

◆dx � 2 log(2)

2JSD(px

, pg

)

The minimum value is �2log(2) when JSD(px

, pg

) = 0 () p

g

(x) = px

(x),8x

Problem formulation

min

max

E

x⇠px

n

log(D

(x))

o

+ E

z⇠pz

n

log(1�D�

(g

(z)))

o

Optimal Generator (Proof sketch)

min

max

E

x⇠px

n

log(D

(x))

o

+ E

x⇠pg

n

log(1�D�

(x))

o

Zp

x

(x) log(D

(x))dx +

Zp

g

(x) log(1�D�

(x))dx

D

⇤�

(x) =p

x

(x)

p

x

(x) + pg

(x)

Zp

x

(x) log

✓p

x

(x)

p

x

(x) + p

g

(x)

◆dx +

Zp

g

(x) log

✓1�

p

x

(x)

p

x

(x) + p

g

(x)

◆dx

Zp

x

(x) log

✓p

x

(x)

p

x

(x) + p

g

(x)

◆dx +

Zp

g

(x) log

✓p

g

(x)

p

x

(x) + p

g

(x)

◆dx

Zp

x

(x) log

✓p

x

(x)

p

x

(x)+pg

(x)2

◆dx +

Zp

g

(x) log

✓p

g

(x)

p

x

(x)+pg

(x)2

◆dx � 2 log(2)

2JSD(px

, pg

)

The minimum value is �2log(2) when JSD(px

, pg

) = 0 () p

g

(x) = px

(x),8x

Problem formulation

min

max

E

x⇠px

n

log(D

(x))

o

+ E

z⇠pz

n

log(1�D�

(g

(z)))

o

Optimal Generator (Proof sketch)

min

max

E

x⇠px

n

log(D

(x))

o

+ E

x⇠pg

n

log(1�D�

(x))

o

Zp

x

(x) log(D

(x))dx +

Zp

g

(x) log(1�D�

(x))dx

D

⇤�

(x) =p

x

(x)

p

x

(x) + pg

(x)

Zp

x

(x) log

✓p

x

(x)

p

x

(x) + p

g

(x)

◆dx +

Zp

g

(x) log

✓1�

p

x

(x)

p

x

(x) + p

g

(x)

◆dx

Zp

x

(x) log

✓p

x

(x)

p

x

(x) + p

g

(x)

◆dx +

Zp

g

(x) log

✓p

g

(x)

p

x

(x) + p

g

(x)

◆dx

Zp

x

(x) log

✓p

x

(x)

p

x

(x)+pg

(x)2

◆dx +

Zp

g

(x) log

✓p

g

(x)

p

x

(x)+pg

(x)2

◆dx � 2 log(2)

2JSD(px

, pg

)

The minimum value is �2log(2) when JSD(px

, pg

) = 0 () p

g

(x) = px

(x),8x

Problem formulation

min

max

E

x⇠px

n

log(D

(x))

o

+ E

z⇠pz

n

log(1�D�

(g

(z)))

o

Optimal Generator (Proof sketch)

min

max

E

x⇠px

n

log(D

(x))

o

+ E

x⇠pg

n

log(1�D�

(x))

o

Zp

x

(x) log(D

(x))dx +

Zp

g

(x) log(1�D�

(x))dx

D

⇤�

(x) =p

x

(x)

p

x

(x) + pg

(x)

Zp

x

(x) log

✓p

x

(x)

p

x

(x) + p

g

(x)

◆dx +

Zp

g

(x) log

✓1�

p

x

(x)

p

x

(x) + p

g

(x)

◆dx

Zp

x

(x) log

✓p

x

(x)

p

x

(x) + p

g

(x)

◆dx +

Zp

g

(x) log

✓p

g

(x)

p

x

(x) + p

g

(x)

◆dx

Zp

x

(x) log

✓p

x

(x)

p

x

(x)+pg

(x)2

◆dx +

Zp

g

(x) log

✓p

g

(x)

p

x

(x)+pg

(x)2

◆dx � 2 log(2)

2JSD(px

, pg

)

The minimum value is �2log(2) when JSD(px

, pg

) = 0 () p

g

(x) = px

(x),8x

In pratice

Training Datag✓(·)g

D�(·)D

z

pg px

pz

D

(x) =p

x

(x)

p

x

(x) + pg

(x)=1

2

In pratice

“Generative Adversarial Networks” ICLR 2014 Ian Goodfellow

In pratice

“Generative Adversarial Networks” ICLR 2014 Ian Goodfellow

DON’T WORRY, THERE IS A DEMO

Why is it better than traditional generative approaches?

Traditional generative approaches

ln p(DATA) = lnZp(DATA, ✓)d✓

= ln

Zq(✓)p(DATA, ✓)q(✓)d✓

�Zq(✓) ln

p(DATA, ✓)q(✓)d✓

.= KL(q(✓), p(DATA, ✓))

The maximum is achieved for q(✓) = p(✓|DATA)

q(✓)

p(✓|DATA)

can be regarded as pgcan be regarded as p

x

Traditional generative approaches

ln p(DATA) = lnZp(DATA, ✓)d✓

= ln

Zq(✓)p(DATA, ✓)q(✓)d✓

�Zq(✓) ln

p(DATA, ✓)q(✓)d✓

.= KL(q(✓), p(DATA, ✓))

The maximum is achieved for q(✓) = p(✓|DATA)

q(✓)

p(✓|DATA)

can be regarded as pgcan be regarded as p

x

Traditional generative approaches

ln p(DATA) = lnZp(DATA, ✓)d✓

= ln

Zq(✓)p(DATA, ✓)q(✓)d✓

�Zq(✓) ln

p(DATA, ✓)q(✓)d✓

.= KL(q(✓), p(DATA, ✓))

The maximum is achieved for q(✓) = p(✓|DATA)

q(✓)

p(✓|DATA)

can be regarded as pgcan be regarded as p

x

Traditional generative approaches

The maximum is achieved for q(✓) = p(✓|DATA)

q(✓)

p(✓|DATA)

can be regarded as pgcan be regarded as p

x

ln p(DATA) = lnZp(DATA, ✓)d✓

= ln

Zq(✓)p(DATA, ✓)q(✓)d✓

�Zq(✓) ln

p(DATA, ✓)q(✓)d✓

.= �KL(q(✓), p(DATA))

Traditional generative approaches

The maximum is achieved for q(✓) = p(✓|DATA)

q(✓) can be regarded as pgcan be regarded as p

x

maxKL(pg

, px

)

ln p(DATA) = lnZp(DATA, ✓)d✓

= ln

Zq(✓)p(DATA, ✓)q(✓)d✓

�Zq(✓) ln

p(DATA, ✓)q(✓)d✓

.= �KL(q(✓), p(DATA))

p(✓,DATA)

Traditional generative approaches

The maximum is achieved for q(✓) = p(✓|DATA)

q(✓) can be regarded as pgcan be regarded as p

x

ln p(DATA) = lnZp(DATA, ✓)d✓

= ln

Zq(✓)p(DATA, ✓)q(✓)d✓

�Zq(✓) ln

p(DATA, ✓)q(✓)d✓

.= �KL(q(✓), p(DATA))

minKL(pg

, px

)

p(✓,DATA)

Traditional generative approaches vs. GANS

Traditional Generative Approaches GANs

Optimization

0 (Low cost for mode dropping) log(2)

Infinity (High cost for fake data) log(2)

Minimum 0 0

minKL(pg

, px

) min JSD(pg

, px

)

px

> 0

pg

! 0

pg

> 0

px

! 0

Is there any issue?

Problem of vanishing gradients

“Towards Principled Methods for Training Generative Adversarial Networks” ICLR 2017 Martin Arjovsky, Leon Bottou

Theorem 2.4: if the discriminator is close to optimality, namely

D

(x) 'p

x

(x)

p

x

(x) + pg

(x)

In other words,

(which is a very common case, i.e. Theorem 2.1-2.2)and the Jacobian of the generator is bounded

(by any scalar )

x

0

1D�

M

+

kr✓Ez⇠pzn

log(1�D�(g✓(z)))o

k2 M✏

1� ✏

D

(x) ' 1,8x 2 supp(px

)

D

(x) ' 0,8x 2 supp(pg

)

Problem of vanishing gradients

“Towards Principled Methods for Training Generative Adversarial Networks” ICLR 2017 Martin Arjovsky, Leon Bottou

Theorem 2.4: if the discriminator is close to optimality, namely

D

(x) 'p

x

(x)

p

x

(x) + pg

(x)

(which is a very common case, i.e. Theorem 2.1-2.2)and the Jacobian of the generator is bounded

(by any scalar )

x

0

1D�

M

+

kr✓Ez⇠pzn

log(1�D�(g✓(z)))o

k2 M✏

1� ✏

This happens when the discriminator is “too strong”. For example, when it is updated more

frequently than the generator

In other words,

D

(x) ' 1,8x 2 supp(px

)

D

(x) ' 0,8x 2 supp(pg

)

Problem of vanishing gradientsProof: kr

Ez⇠p

z

n

log(1�D�

(g✓

(z)))o

k22 = kEz⇠pzn

r✓

log(1�D�

(g✓

(z)))o

k22

= kEz⇠p

z

n

�r✓

D�

(g✓

(z))

1�D�

(g✓

(z))

o

k22

Ez⇠p

z

n

kr✓

D�

(g✓

(z))

1�D�

(g✓

(z))k22o

= Ez⇠p

z

nkr✓

D�

(g✓

(z))k22|1�D

(g✓

(z))|2o

= Ez⇠p

z

nkrx

D�

(g✓

(z))J(g✓

(z))k22|1�D

(g✓

(z))|2o

Ez⇠p

z

nkrx

D�

(g✓

(z))k22kJ(g✓(z))k22|1�D

(g✓

(z))|2o

Since the discriminator is close to optimality, namelykD�

�D⇤�

k .= supx2X

n

|D�

�D⇤�

|+ krx

D�

�rx

D⇤�

k2o

< ✏

(values and gradients are both similar) and since

krx

D�

�rx

D⇤�

k22 � krxD�k22 � krxD⇤�k22

krx

D�

k22 < krxD⇤�k22 + ✏2 Cont.

Cauchy-Schwartz inequality

Jensen’s inequality

Problem of vanishing gradientsProof:

|D� �D⇤�| < ✏|� 1 +D� + 1�D⇤�| < ✏|1�D⇤� � (1�D�)| < ✏|1�D⇤�|� |(1�D�)| |1�D⇤� � (1�D�)||1�D⇤�|� |(1�D�)| < ✏

Ez⇠p

z

nkrx

D�

(g✓

(z))k22kJ(g✓(z))k22|1�D

(g✓

(z))|2o

< Ez⇠p

z

n(krx

D⇤�

(g✓

(z))k22k+ ✏2)J(g✓(z))k22|1�D

(g✓

(z))|2o

< Ez⇠p

z

n(krx

D⇤�

(g✓

(z))k22k+ ✏2)J(g✓(z))k22(|1�D⇤

(g✓

(z))|� ✏)2o

At optimalityrx

D

⇤�

(x) = 0

D

⇤�

(x) = 0,8x 2 supp(pg

) \ L

Therefore

kr✓Ez⇠pzn

log(1�D�(g✓(z)))o

k22 < Ez⇠pzn✏2kJ(g✓(z))k2

(1� ✏)2o

M2✏2

(1� ✏)2 QED

How do you solve it?

Training Datag✓(·)g

D�(·)D

[0, 1]

z

Recent Solution

✏g✓(z) + ✏ x + ✏

“Towards Principled Methods for Training Generative Adversarial Networks” ICLR 2017 Martin Arjovsky, Leon Bottou

Add zero-mean noise

Recent SolutionTheorem 3.2:

D

⇤�

(x) =p

x+✏(x)

p

x+✏(x) + pg+✏(x)

✏ ⇠ N (0,�2I)

+

Ez⇠p

z

n

r✓

log(1�D⇤�

(g✓

(z)))o

= Ez⇠p

z

n

a(z)

Z

p✏

(g✓

� y)r✓

kg✓

� yk2px

(y)dy

b(z)

Z

p✏

(g✓

� y)r✓

kg✓

� yk2pg

(y)dyo

“Towards Principled Methods for Training Generative Adversarial Networks” ICLR 2017 Martin Arjovsky, Leon Bottou

where

b(z) = a(z)px+✏

pg+✏

Recent SolutionTheorem 3.2:

D

⇤�

(x) =p

x+✏(x)

p

x+✏(x) + pg+✏(x)

✏ ⇠ N (0,�2I)

+

“Towards Principled Methods for Training Generative Adversarial Networks” ICLR 2017 Martin Arjovsky, Leon Both

where

b(z) = a(z)px+✏

pg+✏

Ez⇠p

z

n

r✓

log(1�D⇤�

(g✓

(z)))o

= Ez⇠p

z

n

a(z)

Z

p✏

(g✓

(z)� y)r✓

kg✓

(z)� yk2px

(y)dy

�b(z)Z

p✏

(g✓

(z)� y)r✓

kg✓

(z)� yk2pg

(y)dyo

Recent Solution

Proof:

a(z).=1

2�21

px+✏(g✓(z)) + pg+✏(g✓(z))

b(z).=1

2�21

px+✏(g✓(z)) + pg+✏(g✓(z))

px+✏(g✓(z))

pg+✏(g✓(z))

Ez⇠p

z

n

r✓

log(1�D⇤�

(g✓

(z)))o

= Ez⇠p

z

n

r✓

log

pg+✏(g✓(z))

px+✏(g✓(z)) + pg+✏(g✓(z))

o

= Ez⇠p

z

n

r✓

log(pg+✏(g✓(z)))�r✓ log(px+✏(g✓(z)) + pg+✏(g✓(z)))

o

= Ez⇠p

z

nr✓

pg+✏(g✓(z))

pg+✏(g✓(z))

�r✓

px+✏(g✓(z)) +r✓pg+✏(g✓(z))px+✏(g✓(z)) + pg+✏(g✓(z))

o

= Ez⇠p

z

n

r✓

{�px+✏(g✓(z))}�

1

px+✏(g✓(z)) + pg+✏(g✓(z))

px+✏(g✓(z))

pg+✏(g✓(z))

r✓

{�pg+✏(g✓(z))}

o

= Ez⇠p

z

n

2�2a(z)r✓

{�px+✏(g✓(z))}� 2�2b(z)r✓{�pg+✏(g✓(z))}

o

Cont.

Recall that adding two independent random variables produces a random variable with density obtained by convolving the two original densities

Proof:

Recent Solution

QED

(it should be derived)!It comes from the

optimal discriminatorformula which is

assumed fixed (andtherefore not

dependent on \theta)

Ez⇠p

z

n

r✓

log(1�D⇤�

(g✓

(z)))o

= Ez⇠p

z

n

� 2�2a(z)r✓

Z

p✏

(g✓

(z)� y)px

(y)dy+

2�2b(z)r✓

Z

p✏

(g✓

(z)� y)pg

(y)dyo

= Ez⇠p

z

n

� 2�2a(z)r✓

Z

1

Ze�

kg✓

(z)�yk2

2�2 px

(y)dy+

2�2b(z)r✓

Z

1

Ze�

kg✓

(z)�yk2

2�2 pg

(y)dyo

= Ez⇠p

z

n

� 2�2a(z)Z

r✓

1

Ze�

kg✓

(z)�yk2

2�2 px

(y)dy+

2�2b(z)

Z

r✓

1

Ze�

kg✓

(z)�yk2

2�2 pg

(y)dyo

= Ez⇠p

z

n

a(z)

Z

1

Ze�

kg✓

(z)�yk2

2�2 r✓

kg✓

(z)� yk2px

(y)dy�

b(z)

Z

1

Ze�

kg✓

(z)�yk2

2�2 r✓

kg✓

(z)� yk2pg

(y)dyo

= Ez⇠p

z

n

a(z)

Z

p✏

(g✓

(z)� y)r✓

kg✓

(z)� yk2px

(y)dy�

b(z)

Z

p✏

(g✓

(z)� y)r✓

kg✓

(z)� yk2pg

(y)dyo

Why does it work?

InterpretationEz⇠pz

n

r✓ log(1�D⇤�(g✓(z)))o

= Ez⇠pz

n

a(z)

�b(z)

Zp✏

(g✓

(z)� y)r✓

kg✓

(z)� yk2px

(y)dyZp✏(g✓(z)� y)r✓kg✓(z)� yk2pg(y)dy

o

Ez⇠pz

n

r✓ log(1�D⇤�(g✓(z)))o

= Ez⇠pz

n

a(z)

�b(z)

Zp✏

(g✓

(z)� y)r✓

kg✓

(z)� yk2px

(y)dyZp✏(g✓(z)� y)r✓kg✓(z)� yk2pg(y)dy

o

Interpretation

y

px

g✓(z)

Ez⇠pz

n

r✓ log(1�D⇤�(g✓(z)))o

= Ez⇠pz

n

a(z)

�b(z)

Zp✏

(g✓

(z)� y)r✓

kg✓

(z)� yk2px

(y)dyZp✏(g✓(z)� y)r✓kg✓(z)� yk2pg(y)dy

o

Interpretation

y

px

g✓(z)

Ez⇠pz

n

r✓ log(1�D⇤�(g✓(z)))o

= Ez⇠pz

n

a(z)

�b(z)

Zp✏

(g✓

(z)� y)r✓

kg✓

(z)� yk2px

(y)dyZp✏(g✓(z)� y)r✓kg✓(z)� yk2pg(y)dy

o

� >> 1, p✏ ' kAssumption:

Interpretation

y

px

g✓(z)

kg✓(z)� yk2

� >> 1, p✏ ' kAssumption:

Ez⇠pz

n

r✓ log(1�D⇤�(g✓(z)))o

= Ez⇠pz

n

a(z)

�b(z)

Zp✏

(g✓

(z)� y)r✓

kg✓

(z)� yk2px

(y)dyZp✏(g✓(z)� y)r✓kg✓(z)� yk2pg(y)dy

o

Interpretation

y

px

g✓(z)

kg✓(z)� yk2

� >> 1, p✏ ' kAssumption:

r✓kg✓(z)� yk2

Ez⇠pz

n

r✓ log(1�D⇤�(g✓(z)))o

= Ez⇠pz

n

a(z)

�b(z)

Zp✏

(g✓

(z)� y)r✓

kg✓

(z)� yk2px

(y)dyZp✏(g✓(z)� y)r✓kg✓(z)� yk2pg(y)dy

o

Interpretation

y

px

g✓(z)

kg✓(z)� yk2

� >> 1, p✏ ' kAssumption:

Ez⇠pz

n

r✓ log(1�D⇤�(g✓(z)))o

= Ez⇠pz

n

a(z)

�b(z)

Zp✏

(g✓

(z)� y)r✓

kg✓

(z)� yk2px

(y)dyZp✏(g✓(z)� y)r✓kg✓(z)� yk2pg(y)dy

o

�r✓kg✓(z)� yk2

Interpretation

y

px

g✓(z)

kg✓(z)� yk2

� >> 1, p✏ ' kAssumption:

Ez⇠pz

n

r✓ log(1�D⇤�(g✓(z)))o

= Ez⇠pz

n

a(z)

�b(z)

Zp✏

(g✓

(z)� y)r✓

kg✓

(z)� yk2px

(y)dyZp✏(g✓(z)� y)r✓kg✓(z)� yk2pg(y)dy

o

The overall effect is to move generated points close to the data manifold (ATTRACTION TO HIGH DENSITY)

�r✓kg✓(z)� yk2

Interpretation

y

px

g✓(z)

kg✓(z)� yk2

Assumption:

Ez⇠pz

n

r✓ log(1�D⇤�(g✓(z)))o

= Ez⇠pz

n

a(z)

�b(z)

Zp✏

(g✓

(z)� y)r✓

kg✓

(z)� yk2px

(y)dyZp✏(g✓(z)� y)r✓kg✓(z)� yk2pg(y)dy

o

VANISHING GRADIENTS

� ! 0, p✏(g✓(z)� y) ' 0

�r✓kg✓(z)� yk2

Interpretation

y

g✓(z)

Ez⇠pz

n

r✓ log(1�D⇤�(g✓(z)))o

= Ez⇠pz

n

a(z)

�b(z)

Zp✏

(g✓

(z)� y)r✓

kg✓

(z)� yk2px

(y)dyZp✏(g✓(z)� y)r✓kg✓(z)� yk2pg(y)dy

o

pg

Interpretation

y

g✓(z)

Ez⇠pz

n

r✓ log(1�D⇤�(g✓(z)))o

= Ez⇠pz

n

a(z)

�b(z)

Zp✏

(g✓

(z)� y)r✓

kg✓

(z)� yk2px

(y)dyZp✏(g✓(z)� y)r✓kg✓(z)� yk2pg(y)dy

o

pg

� >> 1, p✏ ' kAssumption:

Interpretation

y

g✓(z)

kg✓(z)� yk2

Ez⇠pz

n

r✓ log(1�D⇤�(g✓(z)))o

= Ez⇠pz

n

a(z)

�b(z)

Zp✏

(g✓

(z)� y)r✓

kg✓

(z)� yk2px

(y)dyZp✏(g✓(z)� y)r✓kg✓(z)� yk2pg(y)dy

o

pg

� >> 1, p✏ ' kAssumption:

Interpretation

y

g✓(z)

kg✓(z)� yk2

Ez⇠pz

n

r✓ log(1�D⇤�(g✓(z)))o

= Ez⇠pz

n

a(z)

�b(z)

Zp✏

(g✓

(z)� y)r✓

kg✓

(z)� yk2px

(y)dyZp✏(g✓(z)� y)r✓kg✓(z)� yk2pg(y)dy

o

pg

� >> 1, p✏ ' kAssumption:

�r✓kg✓(z)� yk2

Interpretation

y

g✓(z)

kg✓(z)� yk2

Ez⇠pz

n

r✓ log(1�D⇤�(g✓(z)))o

= Ez⇠pz

n

a(z)

�b(z)

Zp✏

(g✓

(z)� y)r✓

kg✓

(z)� yk2px

(y)dyZp✏(g✓(z)� y)r✓kg✓(z)� yk2pg(y)dy

o

pg

� >> 1, p✏ ' kAssumption:

r✓kg✓(z)� yk2

Interpretation

y

g✓(z)

kg✓(z)� yk2

Ez⇠pz

n

r✓ log(1�D⇤�(g✓(z)))o

= Ez⇠pz

n

a(z)

�b(z)

Zp✏

(g✓

(z)� y)r✓

kg✓

(z)� yk2px

(y)dyZp✏(g✓(z)� y)r✓kg✓(z)� yk2pg(y)dy

o

pg

� >> 1, p✏ ' kAssumption:

r✓kg✓(z)� yk2

The overall effect is to stretch the generated manifold (STRETCHING HIGH DENSITY REGIONS)

Interpretation

Ez⇠pz

n

r✓ log(1�D⇤�(g✓(z)))o

= Ez⇠pz

n

a(z)

�b(z)

Zp✏

(g✓

(z)� y)r✓

kg✓

(z)� yk2px

(y)dyZp✏(g✓(z)� y)r✓kg✓(z)� yk2pg(y)dy

o

b(z) = a(z)px+✏

pg+✏

Interpretation

Ez⇠pz

n

r✓ log(1�D⇤�(g✓(z)))o

= Ez⇠pz

n

a(z)

�b(z)

Zp✏

(g✓

(z)� y)r✓

kg✓

(z)� yk2px

(y)dyZp✏(g✓(z)� y)r✓kg✓(z)� yk2pg(y)dy

o

b(z) = a(z)px+✏

pg+✏

a(z) >> b(z)pg p

x

pg px

Attraction

g✓(z) g✓(z)

Interpretation

Ez⇠pz

n

r✓ log(1�D⇤�(g✓(z)))o

= Ez⇠pz

n

a(z)

�b(z)

Zp✏

(g✓

(z)� y)r✓

kg✓

(z)� yk2px

(y)dyZp✏(g✓(z)� y)r✓kg✓(z)� yk2pg(y)dy

o

b(z) = a(z)px+✏

pg+✏

a(z) >> b(z)pg p

x

pg px

Attraction

px

pga(z) ⇠ b(z)

px

pg

Stretching

g✓(z) g✓(z)

g✓(z) g✓(z)

Interpretation

DEMO

DEMO

Training Data

[0, 1]

z

pg px

pz

1

1

✓1✓2

�1�2

0 1

�(⌃(·))

Normal: GAMES =1500 DISCRIMINATOR_STEPS = 1 GENERATOR_STEPS = 1

Strong discriminator: GAMES = 100 DISCRIMINATOR_STEPS = 600 GENERATOR_STEPS = 1

Strong generator: GAMES = 10 DISCRIMINATOR_STEPS = 1 GENERATOR_STEPS = 600

Open issues

Open issues

1. Other solutions to the vanishing gradient problem? 2. Problem when the dimension of the support of the input distribution

is lower than the dimension of the support of the data distribution

Before training

pg

After training

px

g✓(·)g

pg

pz

dim(supp(pz)) � dim(supp(pg))

?

Open issues

g✓(·)g

pg

pz

dim(supp(pz)) � dim(supp(pg))

1. Other solutions to the vanishing gradient problem? 2. Problem when the dimension of the support of the input distribution

is lower than the dimension of the support of the data distribution

Open issues

Before training

pg

After training

px

g✓(·)g

pg

pz

dim(supp(pz)) � dim(supp(pg))

?

dim(supp(px

)) > dim(supp(pg

))

1. Other solutions to the vanishing gradient problem? 2. Problem when the dimension of the support of the input distribution

is lower than the dimension of the support of the data distribution

top related