manigan: text-guided image...

ManiGAN: Text-Guided Image Manipulation

Bowen Li1 Xiaojuan Qi1,2 Thomas Lukasiewicz1 Philip H. S. Torr11University of Oxford 2University of Hong Kong

{bowen.li, thomas.lukasiewicz}@cs.ox.ac.uk {xiaojuan.qi, philip.torr}@eng.ox.ac.uk

A. ArchitectureWe adopt the ControlGAN [3] as the basic framework

and replace batch normalisation with instance normalisation[6] everywhere in the generator network except in the firststage. Basically, the affine combination module (ACM) canbe inserted anywhere in the generator, but we experimen-tally find that it is best to incorporate the module before up-sampling blocks and image generation networks; see Fig. 2.

A.1. Residual Block

Each residual block contains two convolutional layers,two instance normalisation (IN) [6], and one GLU [1] non-linear function. The architecture of the residual block usedin the detail correction module is shown in Fig. 1.

Conv3� 3×IN

GLU

Conv3� 3×IN

Figure 1. The architecture of the residual block.

B. Objective FunctionsWe train the main module and detail correction module

separately, and the generator and discriminator in both mod-ules are trained alternatively by minimising both the gener-ator loss LG and the discriminator loss LD.Generator objective. The loss function for the generatorfollows those used in ControlGAN [3], but we introduce aregularisation term:

Lreg = 1− 1

CHW||I ′ − I||, (1)

to prevent the network achieving identity mapping, whichcan penalise large perturbations when the generated imagebecomes the same as the input image.

LG =−1

2EI′∼PG

[log(D(I′))

]︸︷︷︸

unconditional adversarial loss

−1

2EI′∼PG

[log(D(I′, S))

]︸︷︷︸

conditional adversarial loss

+ LControlGAN + λ1Lreg,

(2)

LControlGAN = λ2LDAMSM + λ3(1− Lcorre(I′, S)) + λ4Lrec(I

′, I),(3)

where I is the real image sampled from the true image dis-tribution Pdata, S is the corresponding matched text thatcorrectly describes the I , I ′ is the generated image sam-pled from the model distribution PG. The unconditionaladversarial loss makes the synthetic image I ′ indistinguish-able from the real image I , the conditional adversarial lossaligns the generated image I ′ with the given text descrip-tion S, LDAMSM [8] measures the text-image similarity atthe word-level to provide fine-grained feedback for imagegeneration, Lcorre [3] determines whether word-related vi-sual attributes exist in the image, and Lrec [3] reduces ran-domness involved in the generation process. λ1, λ2, λ3, andλ4 are hyperparameters controlling the importance of addi-tional losses. Note that we do not use Lrec when we trainthe detail correction module.Discriminator objective. The loss function for the discrim-inator follows those used in ControlGAN [3], and the func-tion used to train the discriminator in the detail correctionmodule is the same as the one used in the last stage of themain module.LD =−

1

2EI∼Pdata [log(D(I))]−

1

2EI′∼PG

[log(1−D(I′))

]︸︷︷︸

unconditional adversarial loss

−1

2EI∼Pdata [log(D(I, S))]−

1

2EI′∼PG

[log(1−D(I′, S))

]︸︷︷︸

conditional adversarial loss

+ λ3((1− Lcorre(I, S)) + Lcorre(I, S′)),

(4)

where S′ is a given text description randomly sampled fromthe dataset. The unconditional adversarial loss determineswhether the given image is real, and the conditional adver-sarial loss reflects the semantic similarity between imagesand texts.

DDCM

(a) Main Module

Word features

Sentence feature

This bird has a white crown, a yellow bill and

white belly.

G2G1

D1 D2

S′�

Text encoder

D3

G3

Perceptual loss

Regularisationloss

Image encoder

z ∼ N(0,1)

GDCM

(b) Detail Correction Module

spatial and channel-wise

attention


attention


attention

featuresfunctionlayer

Global feature Regional

features

ACM ACM ACM ACM

ACM ACM ACM

Figure 2. The architecture of ManiGAN. ACM denotes the text-image affine combination module. Red dashed box indicates the architectureof the detail correction module.

This bird has ayellow bill, a bluehead, blue wings

and a yellow belly.

Text Original 50 epochs 100 epochs 150 epochs 200 epochs 250 epochs 300 epochs

Figure 3. Trend of the manipulation results over epoch increases on the CUB dataset.

Zebra, dirt.

Text Original 3 epochs 6 epochs 9 epochs 12 epochs 15 epochs 18 epochs

Figure 4. Trend of the manipulation results over epoch increases on the COCO dataset.

C. Trend of Manipulation Results

We track the trend of manipulation results over epochincreases, as shown in Figs. 3 and 4. The original imagesare smoothly modified to achieve the best balance betweenthe generation of new visual attributes (e.g., blue head, bluewings and yellow belly in Fig. 3, dirt background in Fig. 4)and the reconstruction of text-irrelevant contents of the orig-inal images (e.g., the shape of the bird and the backgroundin Fig. 3, the appearance of zebras in Fig. 4). However,when the epoch goes larger, the generated visual attributes(e.g., blue head, blue wings, and yellow belly of the bird,dirt background of the zebras) aligned with the given textdescriptions are gradually erased, and the synthetic imagesbecome more and more similar to the original images. This

verifies the existence of the trade-off between the genera-tion of new visual attributes required in the given text de-scriptions and the reconstruction of text-irrelevant contentsexisting in the original images.

D. Additional Comparison ResultsIn Figs. 5, 6, 7, and 8, we show additional comparison

results between our ManiGAN, SISGAN [2], and TAGAN[5] on the CUB [7] and COCO [4] datasets. Please watchthe accompanying video for detailed comparison.

This bird is blue andgrey with a red belly.

This bird has wings thatare grey and yellowwith a yellow belly.

This bird is black incolour, with a red

crown and a red beak.

This green bird has ablack crown and a

green belly.

A bird with brownwings and a yellowbody, with a yellow

head.

A white bird with greywings and a red bill,with a white belly.

Given Text Original SISGAN [2] TAGAN [5] OursFigure 5. Additional comparison results between ManiGAN, SISGAN, and TAGAN on the CUB bird dataset.

A small blue bird withan orange crown, with

a grey belly.

This bird has a redhead, black eye rings,

and a yellow belly.

This bird is mostly redwith a black beak, and

a black tail.

This tiny bird is blueand has a red bill and a

red belly.

This bird has a whitehead, a yellow bill, and

a yellow belly.

A white bird with redthroat, black eye rings,

and grey wings.

Given Text Original SISGAN [2] TAGAN [5] OursFigure 6. Additional comparison results between ManiGAN, SISGAN, and TAGAN on the CUB bird dataset.

Sunset.

Blue boat, green grass.

White bus.

Man, dry grass.

Brown cow, dirt.

Boy, road.

Given Text Original SISGAN [2] TAGAN [5] OursFigure 7. Additional comparison results between ManiGAN, SISGAN, and TAGAN on the COCO dataset.

Zebra, grass.

Orange bus.

Night.

Kite, green field.

Pizza, pepperoni.

Zebra, water.

Given Text Original SISGAN [2] TAGAN [5] OursFigure 8. Additional comparison results between ManiGAN, SISGAN, and TAGAN on the COCO dataset.

References[1] Yann N Dauphin, Angela Fan, Michael Auli, and David

Grangier. Language modeling with gated convolutional net-works. In Proceedings of the 34th International Conferenceon Machine Learning, pages 933–941, 2017.

[2] Hao Dong, Simiao Yu, Chao Wu, and Yike Guo. Semanticimage synthesis via adversarial learning. In Proceedings ofthe IEEE International Conference on Computer Vision, pages5706–5714, 2017.

[3] Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, and PhilipH. S. Torr. Controllable text-to-image generation. arXivpreprint arXiv:1909.07083, 2019.

[4] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,Pietro Perona, Deva Ramanan, Piotr Dollar, and C. LawrenceZitnick. Microsoft COCO: Common objects in context. InProceedings of the European Conference on Computer Vision,pages 740–755. Springer, 2014.

[5] Seonghyeon Nam, Yunji Kim, and Seon Joo Kim. Text-adaptive generative adversarial networks: manipulating im-ages with natural language. In Advances in Neural Informa-tion Processing Systems, pages 42–51, 2018.

[6] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. In-stance normalization: The missing ingredient for fast styliza-tion. arXiv preprint arXiv:1607.08022, 2016.

[7] Catherine Wah, Steve Branson, Peter Welinder, Pietro Per-ona, and Serge Belongie. The Caltech-UCSD Birds-200-2011dataset. 2011.

[8] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, ZheGan, Xiaolei Huang, and Xiaodong He. AttnGAN: Fine-grained text to image generation with attentional generativeadversarial networks. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 1316–1324, 2018.

manigan: text-guided image...

Documents