arxiv:2005.04909v1 [cs.cv] 11 may 2020to generate the image a person had in mind. with this work, we...

Conditional Image Generation and Manipulation for User-Specified Content

David Stap∗ Maurits Bleeker Sarah Ibrahimi Maartje ter HoeveUniversity of Amsterdam

{d.stap,m.j.r.bleeker,s.ibrahimi,m.a.terhoeve}@uva.nl

Abstract

In recent years, Generative Adversarial Networks(GANs) have improved steadily towards generating increas-ingly impressive real-world images. It is useful to steer theimage generation process for purposes such as content cre-ation. This can be done by conditioning the model on ad-ditional information. However, when conditioning on ad-ditional information, there still exists a large set of imagesthat agree with a particular conditioning. This makes it un-likely that the generated image is exactly as envisioned bya user, which is problematic for practical content creationscenarios such as generating facial composites or stockphotos. To solve this problem, we propose a single pipelinefor text-to-image generation and manipulation. In the firstpart of our pipeline we introduce textStyleGAN, a modelthat is conditioned on text. In the second part of our pipelinewe make use of the pre-trained weights of textStyleGAN toperform semantic facial image manipulation. The approachworks by finding semantic directions in latent space. Weshow that this method can be used to manipulate facial im-ages for a wide range of attributes. Finally, we introducethe CelebTD-HQ dataset, an extension to CelebA-HQ, con-sisting of faces and corresponding textual descriptions.

1. Introduction

Conditional image generation has experienced rapidprogress over the last few years [18, 19, 25, 23]. In thistask, an image is generated by some sort of generativemodel which is conditioned on a number of attributes or ona textual description. For content creation scenarios suchas generating CGI for animation movies, making forensicsketches of suspects for the police, or generating stock pho-tos, it would be beneficial to have the chance to modify theimage after initial generation to close the gap between theuser requirements and the output of the model. This is im-portant because a large number of images can adhere to agiven description. The final image can then be generated by

∗This paper is the product of an internship at the Dutch National Police

Image in user’s mind<latexit sha1_base64="7wIyW1xBhNKWcW6xcate7VGARXE=">AAACBHicbVC7TsMwFHXKq5RXgLGLRYVgqpJSCcZKLLAViT6kNqocx6FWHSeybxBV1IGFX2FhACFWPoKNv8FtM0DLkSwdnXMfvsdPBNfgON9WYWV1bX2juFna2t7Z3bP3D9o6ThVlLRqLWHV9opngkrWAg2DdRDES+YJ1/NHl1O/cM6V5LG9hnDAvIneSh5wSMNLALveBPUB2bWSGucSpZupE44jLYDKwK07VmQEvEzcnFZSjObC/+kFM04hJoIJo3XOdBLyMKOBUsEmpb6YnhI7Mrp6hkkRMe9nsiAk+NkqAw1iZJwHP1N8dGYm0Hke+qYwIDPWiNxX/83ophBdexmWSApN0vihMBYYYTxPBAVeMghgbQqji5q+YDokiFExuJROCu3jyMmnXqu5ZtXZTrzTqeRxFVEZH6BS56Bw10BVqohai6BE9o1f0Zj1ZL9a79TEvLVh5zyH6A+vzB2bvl+M=</latexit>

Conditionally generated image<latexit sha1_base64="R2ns17cu9hwRflFRwp9vx39xaSk=">AAACDXicbVC7SgNBFJ2NrxhfUUubwShYhd0Y0DKQxjKCeUASwuzsTTJkdnaZuSuGJT9g46/YWChia2/n3zh5FJp4YOBwzrl3uMePpTDout9OZm19Y3Mru53b2d3bP8gfHjVMlGgOdR7JSLd8ZkAKBXUUKKEVa2ChL6Hpj6pTv3kP2ohI3eE4hm7IBkr0BWdopV7+rIPwgGk1UoGYKkzKMR2AAs0QAipsHCa9fMEtujPQVeItSIEsUOvlvzpBxJMQFHLJjGl7bozdlGkUXMIk10kMxIyP7PK2pYqFYLrp7JoJPbdKQPuRtk8hnam/J1IWGjMOfZsMGQ7NsjcV//PaCfavu6lQcYKg+PyjfiIpRnRaDQ2EBo72/kAwrm0dnPIh04yjLTBnS/CWT14ljVLRuyyWbsuFSnlRR5ackFNyQTxyRSrkhtRInXDySJ7JK3lznpwX5935mEczzmLmmPyB8/kDiGScbQ==</latexit>

Manipulated image closer

to user’s wish<latexit sha1_base64="bP0P322uxeN6O8DRJDVPVkx9qHE=">AAACIHicbVDLSgNBEJz1GeMr6tHLYBA9hd0YiMeAFy9CBPOAbAizk44ZMju7zPSqYcmnePFXvHhQRG/6NU4eB00sGCiqq7unK4ilMOi6X87S8srq2npmI7u5tb2zm9vbr5so0RxqPJKRbgbMgBQKaihQQjPWwMJAQiMYXIzrjTvQRkTqBocxtEN2q0RPcIZW6uTKPsIDpldMiTiRDKFLhbUA5TIyoOnI96cOjGhihRND74Xpjzq5vFtwJ6CLxJuRPJmh2sl9+t2IJyEo5JIZ0/LcGNsp0yi4hFHWt9Njxgd2d8tSxUIw7XRy4IgeW6VLe5G2TyGdqL87UhYaMwwD6wwZ9s18bSz+V2sl2Dtvp0LFCYLi00W9RFJ77Tgt2hUaOMqhJYxrYf9KeZ9pxtFmmrUhePMnL5J6seCdFYrXpXylNIsjQw7JETklHimTCrkkVVIjnDySZ/JK3pwn58V5dz6m1iVn1nNA/sD5/gFfKaRH</latexit>

Figure 1. Generating a user-specified image. A first approxima-tion is generated using a textual description. The resulting imageis then further manipulated such that it is closer to the user’s desire.

a back and forward consultation of the individual who hasthe “ground truth” of the description in mind.

In this work we focus on the generation of these user-specified images. The overarching question we aim to an-swer is: How can we generate and manipulate an image,based on a fine-grained description, such that it representsthe image a person had in mind? In particular, we focus onthe facial domain and aim to generate and manipulate fa-cial images based on fine-grained descriptions. Since thereexists no dataset of facial images with natural language de-scriptions, we create the CelebFaces Textual DescriptionHigh Quality (CelebTD-HQ) dataset. The dataset consistsof facial images and fine-grained textual descriptions.

Figure 1 shows the pipeline of our method: first we gen-erate an initial version of an image, based on a description.Then we modify this generated image to be able to bettermeet the image a person had in mind. To the best of ourknowledge we are the first to combine the generation andmanipulation in a single pipeline. The manipulation step isessential for practical content creation applications that aim

1

arX

iv:2

005.

0490

9v1

[cs

.CV

] 1

1 M

ay 2

020

to generate the image a person had in mind.With this work, we contribute

(C1) A text-to-image model, textStyleGAN, which can gen-erate images at higher resolution than other text-to-image models and beats the current state of the art inthe majority of cases;

(C2) A method for semantic manipulation that usestextStyleGAN weights. We show that these condi-tional weights can be used for semantic facial manip-ulation;

(C3) An extension to CelebA-HQ [7] dataset, where we usethe attributes to generate natural sounding textual de-scriptions. We call this new dataset the CelebFacesTextual Description High Quality (CelebTD-HQ). Weshare the dataset to facilitate future research.1

2. Related workOur work is related to work on text-to-image synthesis

(first part of our pipeline) and semantic image manipulation(second part of our pipeline). In this section we describerelevant work in both areas. We also discuss StyleGAN [8],which plays an important role in our architecture.

Text-to-image synthesis The goal of text-to-image syn-thesis is to generate an image, given a textual descrip-tion. The image should be visually realistic and semanti-cally consistent with the description. Most text-to image-synthesis methods are based on Generative Adversarial Net-works (GANs) [5]. Pioneering work by Reed et al. [19, 18]shows that plausible low resolution images can be gener-ated from a textual representation. Zhang et al. [25, 26]propose StackGAN, which decomposes the problem intoseveral steps; a Stage-I GAN generates a low resolutionimage conditioned on text, and higher Stage GANs gener-ate higher resolution images conditioned on the results ofearlier stages and the text. More recently, Xu et al. [23]proposed AttnGAN, which exploits an attention mechanismto focus on different words when generating different im-age regions. However, guaranteeing semantic consistencybetween the text description and generated image remainschallenging. As a solution, Qiao et al. [17] introduce a se-mantic text regeneration and alignment (STREAM) mod-ule, which regenerates the text description for the generatedimage. Yin et al. [24] propose a Siamese mechanism in thediscriminator, which distills the semantic commons whileretaining the semantic diversities between different text de-scriptions for an image. Note that all models discussed inthis section do not have a manipulation mechanism, whichrenders them impractical for content creation.

Semantic image manipulation A relatively complextype of image modification is semantic manipulation, which

1https://github.com/davidstap/CelebTD-HQ

can be thought of as changing attributes such as pose,expression or gender. The manipulation should preservethe realism and unedited factors should be left unchanged.Perarnau et al. [15] propose an extension to conditionalGANs [13], where an image is encoded to obtain a vec-tor image representation and a conditional binary represen-tation. This binary representation refers to facial charac-teristics such as gender, hair colour and hair type. Edit-ing is done by changing the binary representation. Shen etal. [21] propose a framework consisting of two generatorswhich perform inverse attribute manipulation given an inputimage, e.g. adding glasses and removing glasses. Chen etal. [3] propose an encoder-decoder architecture, that mod-els face effects with middle-level convolutional layers. Inlater work, Chen et al. [4] introduce a semantic compo-nent model, that does not rely on a generative model. Ato be edited attribute is decomposed into multiple seman-tic components, each corresponding to a specific face re-gion, which are then manipulated independently. A methodbased on a Variational Autoencoder [9] is introduced byQian et al. [16]. Better disentanglement, where the goalis to separate out (disentangle) features into disjoint parts ofa representation, is obtained by clustering in latent space.An image and target facial structural information are bothencoded to latent space, and an output image is generatedbased on the image appearance and target facial structuralinformation.

StyleGan Karras et al. [8] introduce StyleGAN togenerate images – an unconditional generative model.StyleGAN uses a mapping network: instead of feeding thegenerator a noise vector directly, first a non-linear mappingis applied. Karras et al. show empirically that this mappingbetter disentangles the latent factors of variation. Thisimplies that attributes, such as gender, are easily separablein the resulting latent space allowing for easier semanticmanipulation. Furthermore, StyleGAN supports higherresolutions up to 1024× 1024.

Our current work differs from the work discussed in thissection in the following important ways: First, our modelcombines text-to-image generation and semantic manipu-lation, which is important for content creation. Second,we introduce a conditional variant of StyleGAN instead ofbuilding on AttnGAN [17, 24]. This enables us to gener-ate and manipulate at higher resolutions, allowing for morefine-grained details. Third, in contrast to earlier work on se-mantic manipulation, e.g. [21, 3, 4, 16], we make use of thelatent space of a trained GAN to find semantic directions.Our method is simple and computationally cheap, since itonly requires a classification step, without a need to retrainthe GAN or extra computationally expensive modules suchas an additional Generator.

2

https://github.com/davidstap/CelebTD-HQ

manipulation<latexit sha1_base64="5kIKc7GuHIFUCAzbBRywwbjU3Ak=">AAAB/HicbVBNS8NAEN3Ur1q/oj16CRbBU0lqQY8FLx4r2A9oQ9lsN+3S3U3YnYgh1L/ixYMiXv0h3vw3btMctPXBwOO9GWbmBTFnGlz32yptbG5t75R3K3v7B4dH9vFJV0eJIrRDIh6pfoA15UzSDjDgtB8rikXAaS+Y3Sz83gNVmkXyHtKY+gJPJAsZwWCkkV0dAn2ETGDJ4oTn4nxk19y6m8NZJ15BaqhAe2R/DccRSQSVQDjWeuC5MfgZVsAIp/PKMNE0xmSGJ3RgqMSCaj/Lj58750YZO2GkTElwcvX3RIaF1qkITKfAMNWr3kL8zxskEF77GZNxAlSS5aIw4Q5EziIJZ8wUJcBTQzBRzNzqkClWmIDJq2JC8FZfXifdRt27rDfumrVWs4ijjE7RGbpAHrpCLXSL2qiDCErRM3pFb9aT9WK9Wx/L1pJVzFTRH1ifP+g7lYk=</latexit>

+�✓<latexit sha1_base64="7SaRLPTdwAi64cHuGgvGa4AA4lA=">AAACBnicbVBNS8NAEN3Ur1q/oh5FWCyCIJSkFvRY8OKxgq2FJpTNZtsu3WTD7kQooScv/hUvHhTx6m/w5r9x0+agrQPLPt57w8y8IBFcg+N8W6WV1bX1jfJmZWt7Z3fP3j/oaJkqytpUCqm6AdFM8Ji1gYNg3UQxEgWC3Qfj61y/f2BKcxnfwSRhfkSGMR9wSsBQffv4HHvC2EOCvUCKUE8i82UejBiQad+uOjVnVngZuAWooqJaffvLCyVNIxYDFUTrnusk4GdEAaeCTSteqllC6JgMWc/AmERM+9nsjCk+NUyIB1KZFwOesb87MhLpfD/jjAiM9KKWk/9pvRQGV37G4yQFFtP5oEEqMEicZ4JDrhgFMTGAUMXNrpiOiCIUTHIVE4K7ePIy6NRr7kWtftuoNhtFHGV0hE7QGXLRJWqiG9RCbUTRI3pGr+jNerJerHfrY24tWUXPIfpT1ucPKG2Y6A==</latexit>

Discriminator

LG<latexit sha1_base64="5YEX1nKyMnrP1xFVbA9+OKDE1/I=">AAAB9HicbVDLSgMxFL3xWeur6tJNsAiuykwt6LLgQhcuKtgHtEPJpJk2NJMZk0yhDP0ONy4UcevHuPNvzLSz0NYDgcM593JPjh8Lro3jfKO19Y3Nre3CTnF3b//gsHR03NJRoihr0khEquMTzQSXrGm4EawTK0ZCX7C2P77J/PaEKc0j+WimMfNCMpQ84JQYK3m9kJgRJSK9n/Vv+6WyU3HmwKvEzUkZcjT6pa/eIKJJyKShgmjddZ3YeClRhlPBZsVeollM6JgMWddSSUKmvXQeeobPrTLAQaTskwbP1d8bKQm1noa+ncxC6mUvE//zuokJrr2UyzgxTNLFoSAR2EQ4awAPuGLUiKklhCpus2I6IopQY3sq2hLc5S+vkla14l5Wqg+1cr2W11GAUziDC3DhCupwBw1oAoUneIZXeEMT9ILe0cdidA3lOyfwB+jzB80JkhI=</latexit>

LD<latexit sha1_base64="QLOnbQzVLADLQzKgtni1hiSnbOk=">AAAB9HicbVDLSgMxFL3xWeur6tJNsAiuykwt6LKgCxcuKtgHtEPJpJk2NJMZk0yhDP0ONy4UcevHuPNvzLSz0NYDgcM593JPjh8Lro3jfKO19Y3Nre3CTnF3b//gsHR03NJRoihr0khEquMTzQSXrGm4EawTK0ZCX7C2P77J/PaEKc0j+WimMfNCMpQ84JQYK3m9kJgRJSK9n/Vv+6WyU3HmwKvEzUkZcjT6pa/eIKJJyKShgmjddZ3YeClRhlPBZsVeollM6JgMWddSSUKmvXQeeobPrTLAQaTskwbP1d8bKQm1noa+ncxC6mUvE//zuokJrr2UyzgxTNLFoSAR2EQ4awAPuGLUiKklhCpus2I6IopQY3sq2hLc5S+vkla14l5Wqg+1cr2W11GAUziDC3DhCupwBw1oAoUneIZXeEMT9ILe0cdidA3lOyfwB+jzB8h9kg8=</latexit>

ImageEncoder

Text Encoder

Conv 3x3

Conv 3x3

AdaINA

A

B

B

AdaIN

Upsample

Attention

FC

…

noise<latexit sha1_base64="rz2nPF+dNPnt7Bf4ddgricxplBI=">AAAB83icbVBNS8NAEN3Ur1q/qh69BIvgqSS1oMeCF48V7Ac0oWy203bpZjfsTsQS+je8eFDEq3/Gm//GbZuDtj4YeLw3w8y8KBHcoOd9O4WNza3tneJuaW//4PCofHzSNirVDFpMCaW7ETUguIQWchTQTTTQOBLQiSa3c7/zCNpwJR9wmkAY05HkQ84oWikIEJ4wk4obmPXLFa/qLeCuEz8nFZKj2S9/BQPF0hgkMkGN6flegmFGNXImYFYKUgMJZRM6gp6lksZgwmxx88y9sMrAHSptS6K7UH9PZDQ2ZhpHtjOmODar3lz8z+ulOLwJMy6TFEGy5aJhKlxU7jwAd8A1MBRTSyjT3N7qsjHVlKGNqWRD8FdfXiftWtW/qtbu65VGPY+jSM7IObkkPrkmDXJHmqRFGEnIM3klb07qvDjvzseyteDkM6fkD5zPH8fNkh8=</latexit>

w<latexit sha1_base64="mLcwJV3PlO/c9mItWUgbYQGrT/4=">AAAB8XicbVDLSgMxFL3js9ZX1aWbYBFclZla0GXBjcsK9oFtKZk004ZmMkNyRylD/8KNC0Xc+jfu/Bsz7Sy09UDgcM695Nzjx1IYdN1vZ219Y3Nru7BT3N3bPzgsHR23TJRoxpsskpHu+NRwKRRvokDJO7HmNPQlb/uTm8xvP3JtRKTucRrzfkhHSgSCUbTSQy+kOPaD9Gk2KJXdijsHWSVeTsqQozEoffWGEUtCrpBJakzXc2Psp1SjYJLPir3E8JiyCR3xrqWKhtz003niGTm3ypAEkbZPIZmrvzdSGhozDX07mSU0y14m/ud1Ewyu+6lQcYJcscVHQSIJRiQ7nwyF5gzl1BLKtLBZCRtTTRnakoq2BG/55FXSqla8y0r1rlau1/I6CnAKZ3ABHlxBHW6hAU1goOAZXuHNMc6L8+58LEbXnHznBP7A+fwB+MyREQ==</latexit>

FC

ewords<latexit sha1_base64="5EJuHjqdmC5mEO8OB5QGoFPUcaU=">AAACAHicbVBNS8NAEN34WetX1YMHL8EieCpJLeix4MVjBfsBbSmb7aRdutmE3YlaQi7+FS8eFPHqz/Dmv3HT5qCtDwYe780wM8+LBNfoON/Wyura+sZmYau4vbO7t186OGzpMFYMmiwUoep4VIPgEprIUUAnUkADT0Dbm1xnfvselOahvMNpBP2AjiT3OaNopEHpuBdQHHt+Aumgh/CIyUOohjodlMpOxZnBXiZuTsokR2NQ+uoNQxYHIJEJqnXXdSLsJ1QhZwLSYi/WEFE2oSPoGippALqfzB5I7TOjDG0/VKYk2jP190RCA62ngWc6s3P1opeJ/3ndGP2rfsJlFCNINl/kx8LG0M7SsIdcAUMxNYQyxc2tNhtTRRmazIomBHfx5WXSqlbci0r1tlau1/I4CuSEnJJz4pJLUic3pEGahJGUPJNX8mY9WS/Wu/Uxb12x8pkj8gfW5w8gI5dZ</latexit>

esent<latexit sha1_base64="HRjiGorNyfnr7+b/TMBKeDP8iQc=">AAAB/3icbVDLSsNAFJ34rPUVFdy4CRbBVUlqQZcFNy4r2Ac0oUymN+3QyYOZG7HELvwVNy4UcetvuPNvnLRZaOuBgcO593DPHD8RXKFtfxsrq2vrG5ulrfL2zu7evnlw2FZxKhm0WCxi2fWpAsEjaCFHAd1EAg19AR1/fJ3PO/cgFY+jO5wk4IV0GPGAM4pa6pvHbkhx5AcZTPsuwgNmCiKc9s2KXbVnsJaJU5AKKdDsm1/uIGZpqM1MUKV6jp2gl1GJnAmYlt1UQULZmA6hp2lEQ1BeNss/tc60MrCCWOoXoTVTfzsyGio1CX29madVi7Nc/G/WSzG48jIeJSlCxOaHglRYGFt5GdaAS2AoJppQJrnOarERlZShrqysS3AWv7xM2rWqc1Gt3dYrjXpRR4mckFNyThxySRrkhjRJizDySJ7JK3kznowX4934mK+uGIXniPyB8fkDQBiW2g==</latexit>

LCMPM<latexit sha1_base64="Mxv997FbN2fRXsHpulWj7ZdNXPo=">AAACAHicbVDLSsNAFJ3UV62vqAsXboJFcFWSWtBloRsXFirYBzQhTKaTdujkwcyNWEI2/oobF4q49TPc+TdO2iy09cCFwzn3cu89XsyZBNP81kpr6xubW+Xtys7u3v6BfnjUk1EiCO2SiEdi4GFJOQtpFxhwOogFxYHHad+btnK//0CFZFF4D7OYOgEeh8xnBIOSXP3EDjBMCObpbebaQB8hbbU77czVq2bNnMNYJVZBqqhAx9W/7FFEkoCGQDiWcmiZMTgpFsAIp1nFTiSNMZniMR0qGuKASiedP5AZ50oZGX4kVIVgzNXfEykOpJwFnurMz5XLXi7+5w0T8K+dlIVxAjQki0V+wg2IjDwNY8QEJcBnimAimLrVIBMsMAGVWUWFYC2/vEp69Zp1WavfNarNRhFHGZ2iM3SBLHSFmugGdVAXEZShZ/SK3rQn7UV71z4WrSWtmDlGf6B9/gANgJam</latexit>

LCMPC<latexit sha1_base64="mp+VCTopx4xDrMbV7CbYdPtS60U=">AAACAHicbVDLSsNAFJ34rPUVdeHCzWARXJWkFnRZ6MaFQgX7gCaEyXTaDp08mLkRS8jGX3HjQhG3foY7/8ZJm4W2HrhwOOde7r3HjwVXYFnfxsrq2vrGZmmrvL2zu7dvHhx2VJRIyto0EpHs+UQxwUPWBg6C9WLJSOAL1vUnzdzvPjCpeBTewzRmbkBGIR9ySkBLnnnsBATGlIj0JvMcYI+QNm9bzcwzK1bVmgEvE7sgFVSg5ZlfziCiScBCoIIo1betGNyUSOBUsKzsJIrFhE7IiPU1DUnAlJvOHsjwmVYGeBhJXSHgmfp7IiWBUtPA1535uWrRy8X/vH4Cwys35WGcAAvpfNEwERginKeBB1wyCmKqCaGS61sxHRNJKOjMyjoEe/HlZdKpVe2Lau2uXmnUizhK6ASdonNko0vUQNeohdqIogw9o1f0ZjwZL8a78TFvXTGKmSP0B8bnD/4/lpw=</latexit>

wim2R512⇥512<latexit sha1_base64="dWoYBiqi7CmO4K7o8QzGGFBwdx4=">AAACG3icbVBNS8NAEN3Ur1q/qh69BIvgqSS1oseCF49V7Ac0tWy2m3bpZhN2J2pZ8j+8+Fe8eFDEk+DBf+Om7UFbHww83pthZp4fc6bAcb6t3NLyyupafr2wsbm1vVPc3WuqKJGENkjEI9n2saKcCdoABpy2Y0lx6HPa8kcXmd+6o1KxSNzAOKbdEA8ECxjBYKReseKFGIZ+oO/Tngf0ATQLU+0xkU4NX1+nt/rUrXjAQqoMSXvFklN2JrAXiTsjJTRDvVf89PoRSUIqgHCsVMd1YuhqLIERTtOClygaYzLCA9oxVGCzqKsnv6X2kVH6dhBJUwLsifp7QuNQqXHom87sYDXvZeJ/XieB4LyrmYgToIJMFwUJtyGys6DsPpOUAB8bgolk5labDLHEBEycBROCO//yImlWyu5JuXJVLdWqszjy6AAdomPkojNUQ5eojhqIoEf0jF7Rm/VkvVjv1se0NWfNZvbRH1hfP8T+omM=</latexit>

1024⇥1024 block<latexit sha1_base64="a/np0Dbf6JT9atE0qZGMGJih4T4=">AAACB3icbZDLSsNAFIYnXmu9RV0KMlgEVyWpBV0W3LisYC/QhDKZTtqhk0mYORFL6M6Nr+LGhSJufQV3vo2TNgtt/WHg4z/ncOb8QSK4Bsf5tlZW19Y3Nktb5e2d3b19++CwreNUUdaisYhVNyCaCS5ZCzgI1k0UI1EgWCcYX+f1zj1TmsfyDiYJ8yMylDzklICx+vaJ69TqmQc8YnqaM/aAPUCGAxHT8bRvV5yqMxNeBreACirU7Ntf3iCmacQkUEG07rlOAn5GFHAq2LTspZolhI7JkPUMSmL2+tnsjik+M84Ah7EyTwKeub8nMhJpPYkC0xkRGOnFWm7+V+ulEF75GZdJCkzS+aIwFRhinIeCB1wxCmJigFDFzV8xHRFFKJjoyiYEd/HkZWjXqu5FtXZbrzTqRRwldIxO0Tly0SVqoBvURC1E0SN6Rq/ozXqyXqx362PeumIVM0foj6zPHytcmDU=</latexit>

z⇠N (0, I)<latexit sha1_base64="nboI8bxbYNuDVvsHd4EiCsvZMlc=">AAACG3icbVDLSsNAFJ34rPUVdekmWIQKUpJa0GXBjW6kgn1AE8pkOmmHzkzCzESoIf/hxl9x40IRV4IL/8ZJm4K2Hhg4c8693HuPH1EilW1/G0vLK6tr64WN4ubW9s6uubffkmEsEG6ikIai40OJKeG4qYiiuBMJDJlPcdsfXWZ++x4LSUJ+p8YR9hgccBIQBJWWembVZVAN/SB5SBNXEpZO/gjS5CYtzzw7PZ3R6/SkZ5bsij2BtUicnJRAjkbP/HT7IYoZ5gpRKGXXsSPlJVAogihOi24scQTRCA5wV1MOGZZeMrkttY610reCUOjHlTVRf3ckkEk5Zr6uzFaU814m/ud1YxVceAnhUawwR9NBQUwtFVpZUFafCIwUHWsCkSB6VwsNoYBI6TiLOgRn/uRF0qpWnLNK9bZWqtfyOArgEByBMnDAOaiDK9AATYDAI3gGr+DNeDJejHfjY1q6ZOQ9B+APjK8f0fSicA==</latexit>

⇥8<latexit sha1_base64="PQB5CCzzoWS4KcRTXG1phMtGkmc=">AAAB73icbVBNS8NAEJ34WetX1aOXxSJ4Kkkt2GPBi8cK9gPaUDbbTbt0s4m7E6GE/gkvHhTx6t/x5r9x2+agrQ8GHu/NMDMvSKQw6Lrfzsbm1vbObmGvuH9weHRcOjltmzjVjLdYLGPdDajhUijeQoGSdxPNaRRI3gkmt3O/88S1EbF6wGnC/YiOlAgFo2ilbh9FxA2pD0plt+IuQNaJl5My5GgOSl/9YczSiCtkkhrT89wE/YxqFEzyWbGfGp5QNqEj3rNUUbvGzxb3zsilVYYkjLUthWSh/p7IaGTMNApsZ0RxbFa9ufif10sxrPuZUEmKXLHlojCVBGMyf54MheYM5dQSyrSwtxI2ppoytBEVbQje6svrpF2teNeV6n2t3KjlcRTgHC7gCjy4gQbcQRNawEDCM7zCm/PovDjvzseydcPJZ87gD5zPH4esj5Y=</latexit>

(lower resolution blocks)<latexit sha1_base64="D1/4Y3MYE+GO5PO+pAem3uN7KyA=">AAACCXicbVA9SwNBEN3zM8avqKXNYhBiE+5iQMuAjWUE8wFJCHubSbJk7/bYnVPDkdbGv2JjoYit/8DOf+MmuUITHww83pthZp4fSWHQdb+dldW19Y3NzFZ2e2d3bz93cFg3KtYcalxJpZs+MyBFCDUUKKEZaWCBL6Hhj66mfuMOtBEqvMVxBJ2ADULRF5yhlbo52kZ4wKQg1T1oqsEoGU8d6kvFR+Zs0s3l3aI7A10mXkryJEW1m/tq9xSPAwiRS2ZMy3Mj7CRMo+ASJtl2bCBifMQG0LI0ZAGYTjL7ZEJPrdKjfaVthUhn6u+JhAXGjAPfdgYMh2bRm4r/ea0Y+5edRIRRjBDy+aJ+LCkqOo2F9oQGjnJsCeNa2FspHzLNONrwsjYEb/HlZVIvFb3zYummnK+U0zgy5JickALxyAWpkGtSJTXCySN5Jq/kzXlyXpx352PeuuKkM0fkD5zPHxR4moY=</latexit>

wim<latexit sha1_base64="iTgefRwXgxARztjrCB9ByDi1+PY=">AAAB/XicbVDLSsNAFJ3UV62v+Ni5CRbBVUlqQZcFNy4r2Ac0JUymk3boZBJmbtQagr/ixoUibv0Pd/6NkzYLbT0wcDjnXu6Z48ecKbDtb6O0srq2vlHerGxt7+zumfsHHRUlktA2iXgkez5WlDNB28CA014sKQ59Trv+5Cr3u3dUKhaJW5jGdBDikWABIxi05JlHbohh7Afpfea5QB8gZWHmmVW7Zs9gLROnIFVUoOWZX+4wIklIBRCOleo7dgyDFEtghNOs4iaKxphM8Ij2NRU4pGqQztJn1qlWhlYQSf0EWDP190aKQ6Wmoa8n86xq0cvF/7x+AsHlIGUiToAKMj8UJNyCyMqrsIZMUgJ8qgkmkumsFhljiQnowiq6BGfxy8ukU68557X6TaPabBR1lNExOkFnyEEXqImuUQu1EUGP6Bm9ojfjyXgx3o2P+WjJKHYO0R8Ynz+mzJX0</latexit>

wim + �✓<latexit sha1_base64="FW3BxkjfFWurftOex9WSH7QC0+s=">AAACG3icbVDLSsNAFJ34tr6qLt0MFkEQSlILuhTcuKxgW6EJYTK5sYOTBzM3agn5Dzf+ihsXirgSXPg3TtoufF0Y5nDOvTP3nCCTQqNtf1ozs3PzC4tLy7WV1bX1jfrmVk+nueLQ5alM1WXANEiRQBcFSrjMFLA4kNAPrk8rvX8DSos0ucBRBl7MrhIRCc7QUH695cYMh0FU3Ja+i3CHhYhLekBdad4IGXWDVIZ6FJurcHEIyEq/3rCb9rjoX+BMQYNMq+PX390w5XkMCXLJtB44doZewRQKLqGsubmGjPFrdgUDAxMWg/aKsbeS7hkmpFGqzEmQjtnvEwWLdbWf6ayc6N9aRf6nDXKMjr1CJFmOkPDJR1EuKaa0CoqGQgFHOTKAcSXMrpQPmWIcTZw1E4Lz2/Jf0Gs1ncNm67zdOGlP41giO2SX7BOHHJETckY6pEs4uSeP5Jm8WA/Wk/VqvU1aZ6zpzDb5UdbHF68qomU=</latexit>

This man has

straight gray hair

and 5 o’clock shadow.<latexit sha1_base64="/L7dwNZDj4ariQ2iIXzcyN2Dt7U=">AAACPHicbVA9SwNBFNzz2/gVtbRZDKJVuNOIloKNpaJRIRfCu71Nbsne7rH7Tg1HfpiNP8LOysZCEVtrNzGCXwMLw8wb3r6JMiks+v6DNzY+MTk1PTNbmptfWFwqL6+cW50bxutMS20uI7BcCsXrKFDyy8xwSCPJL6Lu4cC/uOLGCq3OsJfxZgodJdqCATqpVT4Nkd+4XHGWCEtTUDQB2w/DL9miAdFJkHYM9JwnzDcTVEzpLtWbTGrWpTaBWF9X+61yxa/6Q9C/JBiRChnhuFW+D2PN8pQrZBKsbQR+hs0CDAomeb8U5pZnwLrQ4Q1HFaTcNovh8X264ZSYtrVxTyEdqt8TBaTW9tLITaaAif3tDcT/vEaO7f1mIVSWI1fsc1E7lxQ1HTRJY2E4Q9lzBJgR7q+UJWCAoeu75EoIfp/8l5xvV4Od6vZJrXJQG9UxQ9bIOtkiAdkjB+SIHJM6YeSWPJJn8uLdeU/eq/f2OTrmjTKr5Ae89w8kf6+8</latexit>

z<latexit sha1_base64="AOWGreB31CAPH+5apf+R6k7U7F4=">AAAB+XicbVDLSsNAFL3xWesr6tLNYBFclaQWdFlw47KCfUATymQ6aYdOJmFmUqghf+LGhSJu/RN3/o2TNgttPTBwOOde7pkTJJwp7Tjf1sbm1vbObmWvun9weHRsn5x2VZxKQjsk5rHsB1hRzgTtaKY57SeS4ijgtBdM7wq/N6NSsVg86nlC/QiPBQsZwdpIQ9v2IqwnQZh5E6yzpzwf2jWn7iyA1olbkhqUaA/tL28UkzSiQhOOlRq4TqL9DEvNCKd51UsVTTCZ4jEdGCpwRJWfLZLn6NIoIxTG0jyh0UL9vZHhSKl5FJjJIqda9QrxP2+Q6vDWz5hIUk0FWR4KU450jIoa0IhJSjSfG4KJZCYrIhMsMdGmrKopwV398jrpNurudb3x0Ky1mmUdFTiHC7gCF26gBffQhg4QmMEzvMKblVkv1rv1sRzdsMqdM/gD6/MHVfaUEg==</latexit>

FC

CA

Figure 2. Overview of our approach. An image is generated from a textual description (Section 3.1). The images are generated progres-sively, starting from a low resolution. We only depict a single Generator block for clarity. The attention model retrieves the most relevantword vectors for generating different sub-regions. The LCMPM and LCMPc losses measure semantic consistency between the text and thegenerated image. The resulting image can be manipulated by making use of semantic directions in latent space (Section 3.2).

3. MethodIn this section we describe our approach, in which we

combine text-to-image synthesis (Part 1) and semantic ma-nipulation (Part 2), in full detail. An overview is given inFigure 2. We conclude this section by a description of ournew CelebTD-HQ dataset.

3.1. Part 1 - textStyleGAN

The first part of our pipeline consists of a text-to-imagestep. This step is visually depicted in the left part of Fig-ure 2. We condition StyleGAN [8] on a textual description.

Generator For the generator, we combine latent variablez ∈ RDz ∼ N (0, I) and text embedding esent ∈ R512 byconcatenation, and perform a linear mapping, resulting inz = W[z; esent] with W ∈ RDz×Dz+512.

Representing textual descriptions In order to obtain textrepresentation esent, we make use of a recent image-textmatching work [27].2 First we pre-train a bi-directionalLSTM (for text) and a CNN (for images) encoder to learna joint embedding space for text and images using twoloss functions: 1) the cross-modal projection matching(CMPM) loss, which minimizes the KL divergence betweenthe projection compatibility distributions and the normal-ized matching distributions and 2) the cross-modal projec-tion classification (CMPC) loss, making use of an auxiliaryclassification task. We then add these pre-trained encodersto textStyleGAN, use the text encoder to obtain sentence

2Recent text-to-image works [23, 17, 24] use a similar but inferior typeof visual-semantic embedding that was introduced by Reed et al. [18]. Fora fair comparison we have experimented with these visual-semantic em-beddings, but did not find any significant difference. We use the embed-dings by [27], as we will make use of the corresponding image decoder forcross-modal similarity enhancement.

feature esent and word features ewords, and the image encoderto encode generated images.3

Conditioning Augmentation Instead of directly usingthe textual representation esent for image generation, weuse Conditioning Augmentation (CA) [25]. This encour-ages smoothness in the latent conditioning manifold. Wesample from N (µ(esent),Σ(esent)) where the mean µ(esent)and covariance matrix Σ(esent) are functions of embeddingesent. These functions are learned using the reparameteri-zation trick [9]. To enforce smoothness and prevent over-fitting we add the following KL divergence regularizationterm during training:

DKL

(N (µ(esent),Σ(esent))||N (0, I)

). (1)

We then feed z through eight fully connected layers tocreate the more disentangled representation w, accordingto the original StyleGAN generator architecture. From thisw an image is then generated.

Attentional guidance Instead of only using the final sen-tence representation esent ∈ RD, the Generator also makesuse of the intermediate representations ewords ∈ RD×T forattentional guidance [23]. Specifically, the attentional guid-ance module takes as input word features ewords and imagefeatures h. The word features are first converted to the samedimensionality by multiplying with a (learnable) matrix, i.e.e′words = Uewords. Then, a word-context vector cj is com-puted for each subregion of the image based on its hiddenfeatures h and word features ewords. Each column of h isa feature vector of a sub-region of the image. For the ith

3We experimented with fine-tuning the text and image encoder but didnot experience improved performance.

3

subregion, ci is a dynamic representation of word vectorsrelevant to hi, which is calculated by

ci =

T∑j=1

αije′j . (2)

The matrix αij—which indicates the weight the modelattends to the jth word when generating the ith subregion ofthe image—is computed by

αij =exp

(score(hi, ej)

)∑T

k=1 exp(

score(hi, ek)) , (3)

where

score(hi, ej) = hTi e′j (4)

is the dot score function [11] which can be thought ofas similarity score between the image sub-region and word.The Generator receives attentional guidance for resolutionsof 64× 64 and upwards.

Discriminator As for the discriminator, we feed it withan image (either real or fake) and corresponding text em-bedding esent. The image is fed through nine convolutionallayers (one for each intermediate resolution in the case of1024 × 1024 images, according to the original StyleGANdiscriminator architecture) and a fully connected layer, re-sulting in feature representation h ∈ RDh . We then appendesent and feed the resulting vector through a final fully con-nected layer, where the Discriminator outputs a value thatrepresents the probability of the image being real, resultingin loss signals LG and LD.

Training loss Following common practice in conditionalsynthesis [25, 26, 23] we employ two adversarial generatorlosses: an unconditional loss, measuring visual realism anda conditional loss that measures semantic consistency.

LG = −1

2Ex∼pG

[logD(x)]︸︷︷︸unconditional loss

−1

2Ex∼pG

[logD(x, esent)]︸︷︷︸conditional loss

.

(5)The discriminator is trained to classify the input image x

as real or fake by minimizing the loss

LD =−1

2Ex∼pdata

[logD(x)]− 1

2Ex∼pG[log 1−D(x)]︸︷︷︸

unconditional loss

− 1

2Ex∼pdata[logD(x, esent)]

−1

2Ex∼pG[log 1−D(x, esent)]︸︷︷︸

conditional loss

. (6)

To additionally ensure semantic consistency betweentextual descriptions and generated images, we make useof cross-modal similarity loss, i.e. CMPM and CMPClosses [27] LCMPM and LCMPC. We use word representa-tions ewords, and encode the generated image to calculatethese losses.

3.2. Part 2 - Semantic manipulation by latent sam-ple classification

The conditionally generated images from Part 1 are un-likely to correspond exactly to what a user had in mindwhen giving a certain description. To account for this weextend our model such that a generated image can be ma-nipulated, by making use of semantic directions in latentspace. Although this method is data agnostic, we highlightthe facial domain. Using this method for other domains ishighly similar.

In the remainder of this section we describe our methodfor finding a wide range of directions, which we call la-tent sample classification. We make use of the disentangledtextStyleGAN latent space. The first key insight of the sec-ond part of our method is that we make use of the latenttextStyleGAN spaceW instead ofZ . Using the latter wouldrequire us to solve the more challenging task of recoveringthe conditioning of an image when finding its latent repre-sentation.

Our method can be described in four steps:

1. Sample n images from a pre-trained (text)StyleGANgenerator.

2. Perform classification for images on a single attribute(e.g., smiling / not smiling, male / female) to obtainlabels.

3. Train a logistic regression classifier on a single at-tribute to find direction.

4. Resulting weight θ is equivalent to the desired direc-tion in latent space.

We describe all four steps in more detail below.

• Step 1 We randomly sample 1000 z ∼ N (0, I) andtransform these into w, which we treat as features in sub-sequent steps. We established empirically by varying nthat more samples do not improve the metric scores.

4

• Step 2 For classification of our generated images wemake use of Face API4, which we treat as a black boxclassification algorithm. This service allows us to obtainclassifications for smile, gender and age. We chooseto use this service as opposed to training our own classi-fication algorithm because we do not want to rely on la-bels; our aim is to devise a method that works in both theconditioned and the unconditioned case.

• Step 3 Why do we train a linear classifier? A majorbenefit of intermediate latent spaceW is that it does nothave to support sampling according to any fixed distri-bution. Instead, its sampling density is a result of thelearned piecewise continuous mapping f(z). Karras etal. [8] show empirically that this mapping unwarpsW sothat the factors of variation become more linear. As a re-sult, manipulating images in this space is expected to bemore flexible to a training set with arbitrary attribute dis-tribution, i.e. manipulating a single feature is less likelyto influence other highly correlated features.

Our aim is to find the correct binary labels y (e.g., smil-ing / not smiling) given features w (sampled latent imagecodes), and therefore we optimize the cross-entropy loss.This allows us to find an approximation of the optimalweights θ.

• Step 4 Intuitively, wim is a point in latent space Wcorresponding to an image after the transformation by thegenerator. The resulting weights θ from the previous stepare representing the desired (approximately linear) direc-tion in latent spaceW . By combining wim and θ, we cantake a step in the desired manipulation direction, result-ing in w ∈ W which (after being fed to the generator)corresponds to a manipulation of the original image wim.That is, to edit an image wim, we simply add (or subtract)θ:

w = wim ± θ. (7)

It is optional to use larger or smaller increments to forexample add more or less smile, i.e. w = wim + λθis a valid operation if one wants more control over theoutcome, which we demonstrate in Section 5.

3.3. CelebTD-HQ

We introduce the CelebTD-HQ dataset, as a first step to-wards creating photo-realistic images of faces given a de-scription by a user. We build on the CelebA-HQ [7] dataset,which consists of facial images and their attributes. We usea probabilistic context-free grammar (PCFG) to generate awide variety of textual descriptions, based on the given at-tributes. See Appendix for details. Figure 3 gives some ex-amples of sentences that we generate. Following the format

4https://azure.microsoft.com/services/cognitive-services/face/

of the popular CUB dataset [22], we create ten unique sin-gle sentence descriptions per image to obtain more trainingdata. We call our dataset CelebFaces Textual DescriptionHigh Quality (CelebTD-HQ). The total dataset consists of30000 images, which we randomly divide into 80% trainand 20% test samples. In Section 4 we describe all otherdatasets that we use for our experiments.

4. Experimental setup4.1. Datasets

To evaluate our pipeline we use the CUB [22] and MSCOCO [10] datasets for the text-to-image part and theFFHQ [8] dataset. We use our new CelebTD-HQ datasetto show that our method for semantic manipulation works.Since our method is domain-agnostic, one could also use itfor semantic manipulation with other images. We describethe first three datasets in more detail here – our CelebTD-HQ was described in Section 3.3.

• CUB [22] contains 11788 images of 200 bird species.Reed et al. [18] collected ten single-sentence visual de-scriptions per image. The data is divided into train andtest sets according to [23], resulting in 8855 train and2933 test samples.

• MS COCO [10] is comprised of complex everydayscenes containing 91 object types in their natural context.It has five textual descriptions per image. The data is di-vided into train and test sets according to [23], resultingin 80000 train and 40000 test samples.

• FFHQ [8] consists of 70000 high-quality images at1024× 1024 resolution. In contrast to the other datasets,it is unlabeled. This dataset is used to train StyleGAN [8].We fine-tune pre-trained StyleGAN weights for condi-tional facial synthesis.

4.2. Ablation study

We present several ablations for the text-to-image task,resulting in the following architectures:

• textStyleGAN base The textStyleGAN base model cor-responds to our architecture as presented in Section 3.1,but without Conditioning Augmentation (CA), attentiveguidance and the cross-modal similarity loss;

• with Conditioning Augmentation This model corre-sponds to the textStyleGAN base model with CA;

• with Attention This model corresponds to the textStyle-GAN base model with CA and attentive guidance;

• with Cross-modal similarity loss This is our completemodel, i.e. the textStyleGAN base model with CA, atten-tive guidance and cross-modal similarity loss.

5

https://azure.microsoft.com/services/cognitive-services/face/

https://azure.microsoft.com/services/cognitive-services/face/

4.3. Implementation details

Part 1 - Conditional Synthesis. TextStyleGAN builds onthe original StyleGAN [8] TensorFlow [1] implementation5.We adopt all training procedures and hyperparameters fromKarras et al. [8]. Training of our conditional image gener-ation models is performed on 4 NVIDIA GeForce 1080TiGPUs with 11GB of RAM. Although the vanilla textStyle-GAN model described above fits on a single GPU, we needto decrease the model size before adding enhancements asdescribed in the previous section. We use mixed precisiontraining [12] and observe no performance drop.

Part 2 - Image manipulation models. For our seman-tic image manipulation method, we make use of the condi-tional textStyleGAN weights. Training of our image ma-nipulation models is performed on a single NVIDIA TitanV GPU with 12GB of RAM.

4.4. Evaluation

Following common practice in text-to-image work [23,17, 24, 29], we report Inception Score (IS) [20] by ran-domly sampling 30000 unseen descriptions from the testsets and R-precision (as described in [23]) on the CUBand COCO datasets. To compare facial image quality withunconditional StyleGAN, we report Frechet Inception Dis-tance (FID) [6] scores. Furthermore, to compare the levelof disentanglement between StyleGAN and our textStyle-GAN, we report perceptual path lengths [8]. To measure se-mantic consistency for textStyleGAN trained on CelebTD-HQ, we report classification scores of several attributes onthe generated images. We evaluate semantic facial imagemanipulation qualitatively.

5. ResultsTo be able to compare to previous work we evaluate both

parts of our pipeline separately on their respective tasks. Wepresent the results in this section.

5.1. Part 1 - textStyleGAN

Following the evaluation protocol for earlier work ontext-to-image for CUB and COCO [25, 26, 28, 23, 17, 24]we have calculated IS to quantify image quality. Scores arepresented in Table 2. Our full text-to-image model performsbest on CUB and second best on COCO, with scores of 4.78and 33.00 respectively. Furthermore, following StyleGANevaluation [8], FID scores for CelebTD-HQ are listed inTable 1. The results indicate that FID scores for our con-ditional model are slightly better than the (unconditional)StyleGAN model.

In order to compare semantic consistency to earlierwork, we calculate R-precision scores for CUB and COCO,

5https://github.com/nvlabs/stylegan

Model CelebTD-HQ ResolutionStyleGAN (unconditioned) [8] 5.17± 0.08 1024× 1024

textStyleGAN base (ours) 5.11± 0.09 1024× 1024

w/ CA 5.11± 0.06 1024× 1024

w/ Attention 5.10± 0.06 1024× 1024

w/ Cross-modal similarity 5.08± 0.07 1024× 1024

Table 1. Frechet Inception Distance for our text-to-image modelson CelebTD-HQ.

Model CUB COCO Resolution

GAN-INT-CLS [19] 2.88± 0.04 7.88± 0.07 64× 64

GAWWN [18] 3.62± 0.07 - 128× 128

StackGAN [25] 3.70± 0.04 8.45± 0.03 256× 256

StackGAN++ [26] 3.82± 0.06 - 256× 256

PPGN [14] - 9.58 ± 0.21 227× 227

HDGAN [28] 4.15± 0.05 11.86± 0.18 256× 256

AttnGAN [23] 4.36± 0.03 25.89± 0.47 256× 256

MirrorGAN [17] 4.56± 0.05 26.47± 0.41 256× 256

SD-GAN [24] 4.67± 0.09 35.69± 0.50 256× 256

DM-GAN [29] 4.75± 0.07 30.49± 0.57 256× 256

textStyleGAN base (ours) 3.89± 0.04 14.85± 0.50 256× 256

w/ CA 4.01± 0.07 16.26± 0.43 256× 256

w/ Attention 4.72± 0.08 32.34± 0.49 256× 256

w/ Cross-modal similarity 4.78± 0.03 33.00± 0.31 256× 256

Table 2. IS for various text-to-image models on CUB and COCOdatasets.

presented in Table 3. The results demonstrate that the gener-ated images are semantically consistent to their textual de-scriptions. Notably, the scores improve significantly withthe cross-modal similarity loss. For k = 1, CUB scores im-prove from 51.52 to 74.72 and COCO scores from 73.02 to87.02. This makes sense, since this loss explicitly forces se-mantic consistency between images and text. Without thisloss, the model learns semantic consistency implicitly, butonly to a certain extent.

Dataset CUB COCO

top-k k=1 k=2 k=3 k=1 k=2 k=3

AttnGAN [23] 53.31 54.11 54.36 72.13 73.21 76.53MirrorGAN [17] 57.67 58.52 60.42 74.52 76.87 80.21DM-GAN [29] 72.31 – – 88.56 – –textStyleGAN base (ours) 40.98 42.43 45.59 60.03 61.47 63.08w/ CA 45.06 46.50 48.33 65.84 67.88 71.59w/ Attention 51.52 53.89 55.24 73.02 75.74 78.38w/ Cross-modal similarity 74.72 76.08 79.56 87.02 87.54 88.23

Table 3. R-precision scores for various text-to-image models onCUB and COCO.

To determine semantic consistency for CelebTD-HQ, wetake a different approach. We again use Face API to deter-mine if attributes that are present in the textual descriptionare also present in the generated image. The results in Ta-ble 4 demonstrate that this is indeed mostly the case.

We are interested in the degree of disentanglement ofthe latent space W in the case of textStyleGAN trainedon CelebTD-HQ. A high degree of disentanglement is de-sirable because this will make manipulating the generatedimages easier. Therefore, we calculated perceptual pathlength, and compare scores to (unconditional) StyleGAN.

6

https://github.com/nvlabs/stylegan

The scores are presented in Table 5, and indicate a similarlevel of disentanglement in both latent spaces. Our condi-tional model has a score of 201.1, slightly higher than the200.5 score for StyleGAN.

Finally, see Figures 3, 4 and 5 for qualitative sam-ples of our full textStyleGAN model on CUB, COCO andCelebTD-HQ.

Attribute textStyleGAN CelebA-HQBald 0.98 1.00

Black Hair 0.93 0.97

Blond Hair 1.00 1.00

Brown Hair 0.96 0.98

Eyeglasses 0.92 1.00

Gray Hair 0.89 0.99

Makeup 0.76 0.86

Male 1.00 1.00

No Beard 1.00 1.00

Smiling 0.87 0.93

Young 0.86 0.90

Table 4. Attribute classification scores for images generated withtextStyleGAN and for the original CelebA-HQ dataset.

Model Perceptual path lengthStyleGAN (unconditional) [8] 200.5textStyleGAN base (ours) 201.4w/ CA 200.1w/ Attention 200.8w/ Cross-modal similarity 201.1

Table 5. Perceptual path length scores for StyleGAN trained onFFHQ and textStyleGAN trained on CelebTD-HQ.

5.2. Part 2 - Semantic manipulation

We present qualitative examples for attribute directionsin Figures 7 (smile direction), 8 (gender direction) and 9(age direction). The results demonstrate that our methodcan change single attributes while holding most othersconstant. However, because of coupled attributes in thebiased training data, in some cases this is not possible.This can be observed in Figure 8. In the top and bottomrow the subject wears earrings (which were not presentbefore manipulation) when edited to become more female.Another interesting observation is that the color of thejacket changes from blue to red in the bottom row inFigure 7.

Removing artifacts We show that our method can im-prove the visual quality of images generated by StyleGAN,which suffers from circular artifacts. We use our method tofind a circular artifact direction in latent space, by manuallyclassifying 250 images into artifact or no artifact. This di-rection can then be subtracted from images with artifacts,often resulting in an image without artifact, as depicted inFigure 6. Most similar is work by Bau et al. [2], who iden-tified units (defined as layers in the generator network) re-

(a) The woman hasher mouth slightlyopen and has blackhair. She is smilingand young.

(b) This person has5 o’clock shadow,wavy hair and bushyeyebrows. He has abig nose.

(c) This young manhas black hair andno beard. He issmiling.

Figure 3. textStyleGAN trained on CelebTD-HQ. Different noisevectors for all images.

(a) this bird haswings that are blackwith yellow belly

(b) this bird hasa white breast andcrown with red feetand grey wings.

(c) a small, lightred bird, with blackprimaries, throat,and eyebrows, witha short bill.

Figure 4. textStyleGAN trained on CUB. Different noise vectorsfor all images.

7

(a) A man on snowskis traveling downa hill.

(b) A large whiteairplane sitting on arunway.

(c) a baseball playerbatting up at homeplate

Figure 5. textStyleGAN trained on COCO. Different noise vectorsfor all images.

Figure 6. Our sample classification method can also be used toremove circular artifacts. The top images are sampled from Style-GAN and suffer from circular artifacts, highlighted by a red circle.These artifacts can often be removed by subtracting the circularartifact direction.

sponsible for artifacts, which are then ablated to suppressthe artifacts. To the best of our knowledge, there is no ear-lier work on removing GAN artifacts by making use of di-rections in the latent space. StyleGAN artifacts are obviousindications of the artificial origin of an image, and beingable to remove these leads to more convincing images.

6. ConclusionThis paper shows that conditional image synthesis and

semantic manipulation can be brought together and utilizedfor practical content creation applications. We have intro-duced a novel text-to-image model, textStyleGAN, that al-lows for semantic facial manipulation after generating animage. This facilitates manipulating conditionally gener-

neutral<latexit sha1_base64="fxl5dQUk/yDGKWqQcqWbEOGUqPM=">AAAB9XicbVDLTgJBEJzFF+IL9ehlIjHxRHaRRI8kXjxiIo8EkMwOvTBhdnYz06uSDf/hxYPGePVfvPk3DrAHBSvppFLVne4uP5bCoOt+O7m19Y3Nrfx2YWd3b/+geHjUNFGiOTR4JCPd9pkBKRQ0UKCEdqyBhb6Elj++nvmtB9BGROoOJzH0QjZUIhCcoZXuuwhPmCpIUDM57RdLbtmdg64SLyMlkqHeL351BxFPQlDIJTOm47kx9lKmUXAJ00I3MRAzPmZD6FiqWAiml86vntIzqwxoEGlbCulc/T2RstCYSejbzpDhyCx7M/E/r5NgcNVLhYoTBMUXi4JEUozoLAI6EBo4yokljGthb6V8xDTjaIMq2BC85ZdXSbNS9i7KldtqqVbN4siTE3JKzolHLkmN3JA6aRBONHkmr+TNeXRenHfnY9Gac7KZY/IHzucPa6KTEA==</latexit>

more smiling<latexit sha1_base64="e+liOJDHre9KWDZKa4Sx8TF96r0=">AAAB/HicbVBNSwMxEM3Wr1q/qj16CRbBU9mtBT0WvHisYD+gXUo2nbahye6SzIplqX/FiwdFvPpDvPlvTNs9aOuDgcd7M8nMC2IpDLrut5Pb2Nza3snvFvb2Dw6PiscnLRMlmkOTRzLSnYAZkCKEJgqU0Ik1MBVIaAeTm7nffgBtRBTe4zQGX7FRKIaCM7RSv1jqITxiqiIN1ChhnxnN+sWyW3EXoOvEy0iZZGj0i1+9QcQTBSFyyYzpem6Mfso0Ci5hVuglBmLGJ2wEXUtDpsD46WL5GT23yoAOI20rRLpQf0+kTBkzVYHtVAzHZtWbi/953QSH134qwjhBCPnyo2EiKUZ0ngQdCA0c5dQSxrWwu1I+ZppxtHkVbAje6snrpFWteJeV6l2tXK9lceTJKTkjF8QjV6RObkmDNAknU/JMXsmb8+S8OO/Ox7I152QzJfIHzucPdKSVPg==</latexit>

less smiling<latexit sha1_base64="5pZ8YmkCPykSU/SvSdfq6WZtAOo=">AAAB/HicbVDLSgNBEJyNrxhfqzl6GQyCp7AbA3oMePEYwTwgWcLspJMMmX0w0ysuS/wVLx4U8eqHePNvnCR70MSChqKqe6a7/FgKjY7zbRU2Nre2d4q7pb39g8Mj+/ikraNEcWjxSEaq6zMNUoTQQoESurECFvgSOv70Zu53HkBpEYX3mMbgBWwcipHgDI00sMt9hEfMJGhNdSDMM+PZwK44VWcBuk7cnFRIjubA/uoPI54EECKXTOue68ToZUyh4BJmpX6iIWZ8ysbQMzRkAWgvWyw/o+dGGdJRpEyFSBfq74mMBVqngW86A4YTverNxf+8XoKjay8TYZwghHz50SiRFCM6T4IOhQKOMjWEcSXMrpRPmGIcTV4lE4K7evI6adeq7mW1dlevNOp5HEVySs7IBXHJFWmQW9IkLcJJSp7JK3mznqwX6936WLYWrHymTP7A+vwBesKVQg==</latexit>

Figure 7. Qualitative results for sample classification method.From left to right: less joy, more joy.


more male<latexit sha1_base64="kuZWb7NrCg4ZWVjHor4yeCTG+no=">AAAB+XicbVBNSwMxEM36WevXqkcvwSJ4Kru1oMeCF48V7Ae0S8mm0zY02V2S2WJZ+k+8eFDEq//Em//GtN2Dtj4YeLw3k8y8MJHCoOd9OxubW9s7u4W94v7B4dGxe3LaNHGqOTR4LGPdDpkBKSJooEAJ7UQDU6GEVji+m/utCWgj4ugRpwkEig0jMRCcoZV6rttFeMJMxRqoYhJmPbfklb0F6Drxc1IiOeo996vbj3mqIEIumTEd30swyJhGwe17xW5qIGF8zIbQsTRiCkyQLTaf0Uur9Okg1rYipAv190TGlDFTFdpOxXBkVr25+J/XSXFwG2QiSlKEiC8/GqSSYkznMdC+0MBRTi1hXAu7K+UjphlHG1bRhuCvnrxOmpWyf12uPFRLtWoeR4GckwtyRXxyQ2rkntRJg3AyIc/klbw5mfPivDsfy9YNJ585I3/gfP4A7JeTzA==</latexit>

more female<latexit sha1_base64="wI7a5IIvp9IV4ZgHrEWxoH+XsOk=">AAAB+3icbVDLSgNBEJyNrxhfazx6GQyCp7AbA3oMePEYwTwgCWF20psMmdldZnolYcmvePGgiFd/xJt/4+Rx0MSChqKqe6a7gkQKg5737eS2tnd29/L7hYPDo+MT97TYNHGqOTR4LGPdDpgBKSJooEAJ7UQDU4GEVjC+m/utJ9BGxNEjThPoKTaMRCg4Qyv13WIXYYKZijXQEBSTMOu7Ja/sLUA3ib8iJbJCve9+dQcxTxVEyCUzpuN7CfYyplFw+16hmxpIGB+zIXQsjZgC08sWu8/opVUGNIy1rQjpQv09kTFlzFQFtlMxHJl1by7+53VSDG97mYiSFCHiy4/CVFKM6TwIOhAaOMqpJYxrYXelfMQ042jjKtgQ/PWTN0mzUvavy5WHaqlWXcWRJ+fkglwRn9yQGrknddIgnEzIM3klb87MeXHenY9la85ZzZyRP3A+fwB34JSr</latexit>

Figure 8. Qualitative results for sample classification method.From left to right: more female, more male.


younger<latexit sha1_base64="1vn66qJCrkGQlJ6TeH/51JYGzwA=">AAAB9XicbVBNS8NAEN3Ur1q/qh69BIvgqSS1oMeCF48V7Ae0sWy2k3bpZhN2J2oI/R9ePCji1f/izX/jts1BWx8MPN6bYWaeHwuu0XG+rcLa+sbmVnG7tLO7t39QPjxq6yhRDFosEpHq+lSD4BJayFFAN1ZAQ19Ax59cz/zOAyjNI3mHaQxeSEeSB5xRNNJ9H+EJszRK5AjUdFCuOFVnDnuVuDmpkBzNQfmrP4xYEoJEJqjWPdeJ0cuoQs4ETEv9RENM2YSOoGeopCFoL5tfPbXPjDK0g0iZkmjP1d8TGQ21TkPfdIYUx3rZm4n/eb0Egysv4zJOECRbLAoSYWNkzyKwh1wBQ5EaQpni5labjamiDE1QJROCu/zyKmnXqu5FtXZbrzTqeRxFckJOyTlxySVpkBvSJC3CiCLP5JW8WY/Wi/VufSxaC1Y+c0z+wPr8AYE4kx4=</latexit>

older<latexit sha1_base64="pL59CTduLnACPUIL/YCYEl12QLs=">AAAB83icbVBNS8NAEN3Ur1q/qh69BIvgqSS1oMeCF48V7Ac0oWw2k3bpZhN2J2IJ/RtePCji1T/jzX/jts1BWx8MPN6bYWZekAqu0XG+rdLG5tb2Tnm3srd/cHhUPT7p6iRTDDosEYnqB1SD4BI6yFFAP1VA40BAL5jczv3eIyjNE/mA0xT8mI4kjzijaCTPQ3jCPBEhqNmwWnPqzgL2OnELUiMF2sPqlxcmLItBIhNU64HrpOjnVCFnAmYVL9OQUjahIxgYKmkM2s8XN8/sC6OEdpQoUxLthfp7Iqex1tM4MJ0xxbFe9ebif94gw+jGz7lMMwTJlouiTNiY2PMA7JArYCimhlCmuLnVZmOqKEMTU8WE4K6+vE66jbp7VW/cN2utZhFHmZyRc3JJXHJNWuSOtEmHMJKSZ/JK3qzMerHerY9la8kqZk7JH1ifP7uIkhc=</latexit>

Figure 9. Qualitative results for sample classification method.From left to right: older, younger.

ated images, hence effectively unifying conditional genera-tion and semantic manipulation. We quantitatively showedthat the model is comparable to the state-of-the-art while al-lowing for higher resolutions. Finally, we have introducedCelebTD-HQ, a facial dataset with full length text descrip-tions based on attributes. We show that this dataset can beused to generate and manipulate facial images, and we be-lieve applications such as facial stock photo creation is pos-sible with our approach. For future work the exploration ofmore complex attributes for semantic manipulation can beconsidered.

7. AcknowledgementsThis research was supported by the Nationale Politie. All con-

tent represents the opinion of the authors, which is not necessarilyshared or endorsed by their respective employers and/or sponsors.

8

References[1] Martin Abadi, Ashish Agarwal, Paul Barham, Eugene

Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, AndyDavis, Jeffrey Dean, Matthieu Devin, and others. Tensor-Flow: Large-scale machine learning on heterogeneous sys-tems, 2015. Software available from tensorflow. org, 1(2),2015. 6

[2] David Bau, Jun-Yan Zhu, Hendrik Strobelt, Bolei Zhou,Joshua B Tenenbaum, William T Freeman, and Anto-nio Torralba. GAN Dissection: Visualizing and Under-standing Generative Adversarial Networks. arXiv preprintarXiv:1811.10597, 2018. 7

[3] Ying-Cong Chen, Huaijia Lin, Michelle Shu, Ruiyu Li, XinTao, Xiaoyong Shen, Yangang Ye, and Jiaya Jia. Facelet-bank for fast portrait manipulation. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion, pages 3541–3549, 2018. 2

[4] Ying-Cong Chen, Xiaohui Shen, Zhe Lin, Xin Lu, I Pao, Ji-aya Jia, and others. Semantic Component Decompositionfor Face Attribute Manipulation. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition,pages 9859–9867, 2019. 2

[5] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, BingXu, David Warde-Farley, Sherjil Ozair, Aaron Courville, andYoshua Bengio. Generative adversarial nets. In Advancesin neural information processing systems, pages 2672–2680,2014. 2

[6] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,Bernhard Nessler, and Sepp Hochreiter. Gans trained by atwo time-scale update rule converge to a local nash equilib-rium. In Advances in Neural Information Processing Sys-tems, pages 6626–6637, 2017. 6

[7] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen.Progressive growing of gans for improved quality, stability,and variation. arXiv preprint arXiv:1710.10196, 2018. 2, 5

[8] Tero Karras, Samuli Laine, and Timo Aila. A Style-Based Generator Architecture for Generative AdversarialNetworks. arXiv preprint arXiv:1812.04948, 2018. 2, 3,5, 6, 7

[9] Diederik P Kingma and Max Welling. Auto-encoding varia-tional bayes. arXiv preprint arXiv:1312.6114, 2013. 2, 3

[10] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,Pietro Perona, Deva Ramanan, Piotr Dollar, and C LawrenceZitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755.Springer, 2014. 5

[11] Minh-Thang Luong, Hieu Pham, and Christopher D Man-ning. Effective approaches to attention-based neural machinetranslation. arXiv preprint arXiv:1508.04025, 2015. 4

[12] Paulius Micikevicius, Sharan Narang, Jonah Alben, GregoryDiamos, Erich Elsen, David Garcia, Boris Ginsburg, MichaelHouston, Oleksii Kuchaiev, Ganesh Venkatesh, and others.Mixed precision training. arXiv preprint arXiv:1710.03740,2017. 6

[13] Mehdi Mirza and Simon Osindero. Conditional generativeadversarial nets. arXiv preprint arXiv:1411.1784, 2014. 2

[14] Anh Nguyen, Jeff Clune, Yoshua Bengio, Alexey Dosovit-skiy, and Jason Yosinski. Plug & play generative networks:

Conditional iterative generation of images in latent space.In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 4467–4477, 2017. 6

[15] Guim Perarnau, Joost Van De Weijer, Bogdan Raducanu, andJose M Alvarez. Invertible conditional gans for image edit-ing. arXiv preprint arXiv:1611.06355, 2016. 2

[16] Shengju Qian, Kwan-Yee Lin, Wayne Wu, YangxiaokangLiu, Quan Wang, Fumin Shen, Chen Qian, and Ran He.Make a Face: Towards Arbitrary High Fidelity Face Manip-ulation. arXiv preprint arXiv:1908.07191, 2019. 2

[17] Tingting Qiao, Jing Zhang, Duanqing Xu, and DachengTao. MirrorGAN: Learning Text-to-image Generation byRedescription. arXiv preprint arXiv:1903.05854, 2019. 2,3, 6

[18] Scott Reed, Zeynep Akata, Honglak Lee, and Bernt Schiele.Learning deep representations of fine-grained visual descrip-tions. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 49–58, 2016. 1, 2, 3,5, 6

[19] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Lo-geswaran, Bernt Schiele, and Honglak Lee. Genera-tive adversarial text to image synthesis. arXiv preprintarXiv:1605.05396, 2016. 1, 2, 6

[20] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, VickiCheung, Alec Radford, and Xi Chen. Improved techniquesfor training gans. In Advances in Neural Information Pro-cessing Systems, pages 2234–2242, 2016. 6

[21] Wei Shen and Rujie Liu. Learning residual images for faceattribute manipulation. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, pages4030–4038, 2017. 2

[22] Catherine Wah, Steve Branson, Peter Welinder, Pietro Per-ona, and Serge Belongie. The caltech-ucsd birds-200-2011dataset. 2011. 5

[23] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang,Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generativeadversarial networks. In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition, pages1316–1324, 2018. 1, 2, 3, 4, 5, 6

[24] Guojun Yin, Bin Liu, Lu Sheng, Nenghai Yu, XiaogangWang, and Jing Shao. Semantics Disentangling for Text-to-Image Generation. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 2327–2336, 2019. 2, 3, 6

[25] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiao-gang Wang, Xiaolei Huang, and Dimitris N Metaxas. Stack-gan: Text to photo-realistic image synthesis with stackedgenerative adversarial networks. In Proceedings of the IEEEInternational Conference on Computer Vision, pages 5907–5915, 2017. 1, 2, 3, 4, 6

[26] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiao-gang Wang, Xiaolei Huang, and Dimitris N Metaxas. Stack-gan++: Realistic image synthesis with stacked generative ad-versarial networks. IEEE transactions on pattern analysisand machine intelligence, 41(8):1947–1962, 2018. 2, 4, 6

[27] Ying Zhang and Huchuan Lu. Deep Cross-Modal ProjectionLearning for Image-Text Matching. In Proceedings of the

9

European Conference on Computer Vision (ECCV), pages686–701, 2018. 3, 4

[28] Zizhao Zhang, Yuanpu Xie, and Lin Yang. Photographictext-to-image synthesis with a hierarchically-nested adver-sarial network. In The IEEE Conference on Computer Visionand Pattern Recognition (CVPR), 2018. 6

[29] Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang. Dm-gan:Dynamic memory generative adversarial networks for text-to-image synthesis. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 5802–5810, 2019. 6

Appendix A. CelebTD-HQ dataset

Rule ProbabilityS → NP VP 1NP → Det Gender 0.5NP → PN 0.5VP → Wearing PN Are PN HaveWith 0.166VP → Wearing PN HaveWith PN Are 0.166VP → Are PN Wearing PN HaveWith 0.166VP → Are PN HaveWith PN Wearing 0.166VP → HaveWith PN Wearing PN Are 0.166VP → HaveWith PN Are PN Wearing 0.166Wearing → WearVerb WearAttributes 1Are → IsVerb IsAttributes 1HaveWith → HaveVerb HaveAttributes 1Det → a 0.5Det → this 0.5Gender → gender 0.75Gender → person 0.25PN → pn 1WearVerb → wears 0.5WearVerb → is wearing 0.5WearAt t r ibu tes → wears 1IsVerb → is 1I s A t t r i b u t e s → is 1HaveVerb → has 1HaveA t t t r i bu tes→ HaveWith 1

Figure 10. PCFG used to generate captions for our CelebTD-HQdataset.

See Figure 10 for the PCFG used to generate captions for ourCelebTD-HQ dataset. Note that some terminal symbols have boldvalues, which can be thought of as an attribute list where either asingle option is picked (e.g. male / female / man / woman in thecase of Gender) or a list of attributes in the case of WearAttributes(e.g., glasses, hat), IsAttributes (e.g. smiling) and HaveAttributes(e.g. blonde hair). To promote diversity of the generated sen-tences, we choose equal probabilities when multiple options areavailable. (Except for the case of gender, where we mention thegender explicitly more often than not.) We only consider the ac-tive binary attributes, and ignore the inactive ones. A total of n at-tributes per description is randomly selected, where n ∼ N (5, 1)is rounded to the nearest integer. This prevents that the descrip-tion will be an exhaustive list of attributes, which does not soundnatural.

10

arxiv:2005.04909v1 [cs.cv] 11 may 2020to generate the image a person had in mind. with this work, we...

Documents