manigan: text-guided image...
TRANSCRIPT
ManiGAN: Text-Guided Image Manipulation
Bowen Li1 Xiaojuan Qi1,2 Thomas Lukasiewicz1 Philip H. S. Torr11University of Oxford 2University of Hong Kong
{bowen.li, thomas.lukasiewicz}@cs.ox.ac.uk {xiaojuan.qi, philip.torr}@eng.ox.ac.uk
A. ArchitectureWe adopt the ControlGAN [3] as the basic framework
and replace batch normalisation with instance normalisation[6] everywhere in the generator network except in the firststage. Basically, the affine combination module (ACM) canbe inserted anywhere in the generator, but we experimen-tally find that it is best to incorporate the module before up-sampling blocks and image generation networks; see Fig. 2.
A.1. Residual Block
Each residual block contains two convolutional layers,two instance normalisation (IN) [6], and one GLU [1] non-linear function. The architecture of the residual block usedin the detail correction module is shown in Fig. 1.
Conv3� 3×IN
GLU
Conv3� 3×IN
Figure 1. The architecture of the residual block.
B. Objective FunctionsWe train the main module and detail correction module
separately, and the generator and discriminator in both mod-ules are trained alternatively by minimising both the gener-ator loss LG and the discriminator loss LD.Generator objective. The loss function for the generatorfollows those used in ControlGAN [3], but we introduce aregularisation term:
Lreg = 1− 1
CHW||I ′ − I||, (1)
to prevent the network achieving identity mapping, whichcan penalise large perturbations when the generated imagebecomes the same as the input image.
LG =−1
2EI′∼PG
[log(D(I′))
]︸ ︷︷ ︸
unconditional adversarial loss
−1
2EI′∼PG
[log(D(I′, S))
]︸ ︷︷ ︸
conditional adversarial loss
+ LControlGAN + λ1Lreg,
(2)
LControlGAN = λ2LDAMSM + λ3(1− Lcorre(I′, S)) + λ4Lrec(I
′, I),(3)
where I is the real image sampled from the true image dis-tribution Pdata, S is the corresponding matched text thatcorrectly describes the I , I ′ is the generated image sam-pled from the model distribution PG. The unconditionaladversarial loss makes the synthetic image I ′ indistinguish-able from the real image I , the conditional adversarial lossaligns the generated image I ′ with the given text descrip-tion S, LDAMSM [8] measures the text-image similarity atthe word-level to provide fine-grained feedback for imagegeneration, Lcorre [3] determines whether word-related vi-sual attributes exist in the image, and Lrec [3] reduces ran-domness involved in the generation process. λ1, λ2, λ3, andλ4 are hyperparameters controlling the importance of addi-tional losses. Note that we do not use Lrec when we trainthe detail correction module.Discriminator objective. The loss function for the discrim-inator follows those used in ControlGAN [3], and the func-tion used to train the discriminator in the detail correctionmodule is the same as the one used in the last stage of themain module.LD =−
1
2EI∼Pdata [log(D(I))]−
1
2EI′∼PG
[log(1−D(I′))
]︸ ︷︷ ︸
unconditional adversarial loss
−1
2EI∼Pdata [log(D(I, S))]−
1
2EI′∼PG
[log(1−D(I′, S))
]︸ ︷︷ ︸
conditional adversarial loss
+ λ3((1− Lcorre(I, S)) + Lcorre(I, S′)),
(4)
where S′ is a given text description randomly sampled fromthe dataset. The unconditional adversarial loss determineswhether the given image is real, and the conditional adver-sarial loss reflects the semantic similarity between imagesand texts.
DDCM
(a) Main Module
Word features
Sentence feature
This bird has a white crown, a yellow bill and
white belly.
G2G1
D1 D2
S′�
Text encoder
D3
G3
Perceptual loss
Regularisationloss
Image encoder
z ∼ N(0,1)
GDCM
(b) Detail Correction Module
spatial and channel-wise
attention
spatial and channel-wise
attention
spatial and channel-wise
attention
featuresfunctionlayer
Global feature Regional
features
ACM ACM ACM ACM
ACM ACM ACM
Figure 2. The architecture of ManiGAN. ACM denotes the text-image affine combination module. Red dashed box indicates the architectureof the detail correction module.
This bird has ayellow bill, a bluehead, blue wings
and a yellow belly.
Text Original 50 epochs 100 epochs 150 epochs 200 epochs 250 epochs 300 epochs
Figure 3. Trend of the manipulation results over epoch increases on the CUB dataset.
Zebra, dirt.
Text Original 3 epochs 6 epochs 9 epochs 12 epochs 15 epochs 18 epochs
Figure 4. Trend of the manipulation results over epoch increases on the COCO dataset.
C. Trend of Manipulation Results
We track the trend of manipulation results over epochincreases, as shown in Figs. 3 and 4. The original imagesare smoothly modified to achieve the best balance betweenthe generation of new visual attributes (e.g., blue head, bluewings and yellow belly in Fig. 3, dirt background in Fig. 4)and the reconstruction of text-irrelevant contents of the orig-inal images (e.g., the shape of the bird and the backgroundin Fig. 3, the appearance of zebras in Fig. 4). However,when the epoch goes larger, the generated visual attributes(e.g., blue head, blue wings, and yellow belly of the bird,dirt background of the zebras) aligned with the given textdescriptions are gradually erased, and the synthetic imagesbecome more and more similar to the original images. This
verifies the existence of the trade-off between the genera-tion of new visual attributes required in the given text de-scriptions and the reconstruction of text-irrelevant contentsexisting in the original images.
D. Additional Comparison ResultsIn Figs. 5, 6, 7, and 8, we show additional comparison
results between our ManiGAN, SISGAN [2], and TAGAN[5] on the CUB [7] and COCO [4] datasets. Please watchthe accompanying video for detailed comparison.
This bird is blue andgrey with a red belly.
This bird has wings thatare grey and yellowwith a yellow belly.
This bird is black incolour, with a red
crown and a red beak.
This green bird has ablack crown and a
green belly.
A bird with brownwings and a yellowbody, with a yellow
head.
A white bird with greywings and a red bill,with a white belly.
Given Text Original SISGAN [2] TAGAN [5] OursFigure 5. Additional comparison results between ManiGAN, SISGAN, and TAGAN on the CUB bird dataset.
A small blue bird withan orange crown, with
a grey belly.
This bird has a redhead, black eye rings,
and a yellow belly.
This bird is mostly redwith a black beak, and
a black tail.
This tiny bird is blueand has a red bill and a
red belly.
This bird has a whitehead, a yellow bill, and
a yellow belly.
A white bird with redthroat, black eye rings,
and grey wings.
Given Text Original SISGAN [2] TAGAN [5] OursFigure 6. Additional comparison results between ManiGAN, SISGAN, and TAGAN on the CUB bird dataset.
Sunset.
Blue boat, green grass.
White bus.
Man, dry grass.
Brown cow, dirt.
Boy, road.
Given Text Original SISGAN [2] TAGAN [5] OursFigure 7. Additional comparison results between ManiGAN, SISGAN, and TAGAN on the COCO dataset.
Zebra, grass.
Orange bus.
Night.
Kite, green field.
Pizza, pepperoni.
Zebra, water.
Given Text Original SISGAN [2] TAGAN [5] OursFigure 8. Additional comparison results between ManiGAN, SISGAN, and TAGAN on the COCO dataset.
References[1] Yann N Dauphin, Angela Fan, Michael Auli, and David
Grangier. Language modeling with gated convolutional net-works. In Proceedings of the 34th International Conferenceon Machine Learning, pages 933–941, 2017.
[2] Hao Dong, Simiao Yu, Chao Wu, and Yike Guo. Semanticimage synthesis via adversarial learning. In Proceedings ofthe IEEE International Conference on Computer Vision, pages5706–5714, 2017.
[3] Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, and PhilipH. S. Torr. Controllable text-to-image generation. arXivpreprint arXiv:1909.07083, 2019.
[4] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,Pietro Perona, Deva Ramanan, Piotr Dollar, and C. LawrenceZitnick. Microsoft COCO: Common objects in context. InProceedings of the European Conference on Computer Vision,pages 740–755. Springer, 2014.
[5] Seonghyeon Nam, Yunji Kim, and Seon Joo Kim. Text-adaptive generative adversarial networks: manipulating im-ages with natural language. In Advances in Neural Informa-tion Processing Systems, pages 42–51, 2018.
[6] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. In-stance normalization: The missing ingredient for fast styliza-tion. arXiv preprint arXiv:1607.08022, 2016.
[7] Catherine Wah, Steve Branson, Peter Welinder, Pietro Per-ona, and Serge Belongie. The Caltech-UCSD Birds-200-2011dataset. 2011.
[8] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, ZheGan, Xiaolei Huang, and Xiaodong He. AttnGAN: Fine-grained text to image generation with attentional generativeadversarial networks. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 1316–1324, 2018.