imagan: unsupervised training of conditional joint ...to the cyclegan [29], two gans are jointly...

16
ImaGAN: Unsupervised Training of Conditional Joint CycleGAN for Transferring Style with Core Structures in Content Preserved Kang Min Bae [0000-0002-1285-8384] , Minuk Ma [0000-0003-0416-8479] , Hyunjun Jang [0000-0002-5341-6697] , Minjeong Ju [0000-0002-9639-2090] , Hyoungwoo Park [0000-0002-4633-1118] , and Chang D. Yoo KAIST, Daejeon, South Korea {kmbae94, akalsdnr, wiseholi, jujeong94, hyoungwoo.park, cd yoo}@kaist.ac.kr Abstract. This paper considers conditional image generation that merges the structure of one object with the style of another. In short, the style of an image has been substituted or replaced by the style of another image. An architecture for extracting the structure of one image and another architecture for merging the extracted structure and the style of another image is considered. The proposed ImaGAN architecture consists of two networks: (1) style removal network R that removes style information and leaves only the edge information and (2) the generation network G that fills the extracted structure with the style of another image. This archi- tecture allows various pairing of style and structure which would not have been possible with image-to-image translation method. This architecture incurs minimal classification error compared prior style transfer and do- main transfer algorithms. Experimental result using edges2handbags and edges2shoes dataset reveal that ImaGAN can transfer the style of one image to another without distorting the target structure. Keywords: Image-to-image translation · Style transfer · CycleGAN · Conditional generation · Deep learning. 1 Introduction We often wonder what it would look like to merge the structure of one object with the style of another. For instance, merging the structure of a yellow rain boot with the style of a leather bag will create leather rain boot which we can only imagine. Assume elements from two domains x A1 A 1 and x A2 A 2 as depicted in Fig. 1 where each element in the domains share structural similarities, but can have different styles. The main task is to transfer the style of x A1 to x A2 with the core structures of x A2 preserved. In general, transferring the style while preserving the structure is a challenging problem owing to the difficulty in defining the concept of structure and style. There are two related research areas: (1) style transfer and (2) image-to-image translation.

Upload: others

Post on 01-Apr-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ImaGAN: Unsupervised Training of Conditional Joint ...to the CycleGAN [29], two GANs are jointly trained with cycle-consistency con-straints. However, in contrast to the CycleGAN [29],

ImaGAN: Unsupervised Training of ConditionalJoint CycleGAN for Transferring Style with

Core Structures in Content Preserved

Kang Min Bae[0000−0002−1285−8384], Minuk Ma[0000−0003−0416−8479],Hyunjun Jang[0000−0002−5341−6697], Minjeong Ju[0000−0002−9639−2090],

Hyoungwoo Park[0000−0002−4633−1118], and Chang D. Yoo

KAIST, Daejeon, South Korea{kmbae94, akalsdnr, wiseholi, jujeong94,

hyoungwoo.park, cd yoo}@kaist.ac.kr

Abstract. This paper considers conditional image generation that mergesthe structure of one object with the style of another. In short, the style ofan image has been substituted or replaced by the style of another image.An architecture for extracting the structure of one image and anotherarchitecture for merging the extracted structure and the style of anotherimage is considered. The proposed ImaGAN architecture consists of twonetworks: (1) style removal network R that removes style information andleaves only the edge information and (2) the generation network G thatfills the extracted structure with the style of another image. This archi-tecture allows various pairing of style and structure which would not havebeen possible with image-to-image translation method. This architectureincurs minimal classification error compared prior style transfer and do-main transfer algorithms. Experimental result using edges2handbags andedges2shoes dataset reveal that ImaGAN can transfer the style of oneimage to another without distorting the target structure.

Keywords: Image-to-image translation · Style transfer · CycleGAN ·Conditional generation · Deep learning.

1 Introduction

We often wonder what it would look like to merge the structure of one objectwith the style of another. For instance, merging the structure of a yellow rainboot with the style of a leather bag will create leather rain boot which we canonly imagine. Assume elements from two domains xA1

∈ A1 and xA2∈ A2 as

depicted in Fig. 1 where each element in the domains share structural similarities,but can have different styles. The main task is to transfer the style of xA1 toxA2 with the core structures of xA2 preserved. In general, transferring the stylewhile preserving the structure is a challenging problem owing to the difficulty indefining the concept of structure and style. There are two related research areas:(1) style transfer and (2) image-to-image translation.

Page 2: ImaGAN: Unsupervised Training of Conditional Joint ...to the CycleGAN [29], two GANs are jointly trained with cycle-consistency con-straints. However, in contrast to the CycleGAN [29],

2 K. Bae et al.

Fig. 1. Main task: For given two inputs, content and style image, the task is to transferthe style from style image and apply it to content image while preserving the structure.

Style transfer considers transferring the source style to that of the targetimage. Gatys et al. [7] tried to apply convolutional neural networks (CNNs)for neural style transfer. Following this work, there have been numerous efforts[10, 3, 17, 12, 9, 24, 19, 6] to make this transfer as fast as possible and seamless.Huang et al. [10] proposed a straightforward method for style transfer. Thesemethods generally do not preserve fine-grained structural information and looseedge information.

Aside from style transfer, there have been considerable interest in image-to-image translation [2, 11, 29, 14, 1, 13, 15]. Isola et al. [11] adopted conditionaladversarial networks for image-to-image translation. Zhu et al. [29] proposedcycle-consistent adversarial networks for unpaired image-to-image translation.Kim et al. [15] proposed DiscoGAN which jointly trains two generative adversar-ial networks (GAN) to learn cross-domain relations. The majority of the previousworks requires two domains: domain of stylized image and domain of un-stylizedimage. However, gathering such dataset is often costly.

To overcome the aforementioned challenges, this paper proposes ImaGAN.The ImaGAN can be viewed as a conditional version of CycleGAN [29]. Similarto the CycleGAN [29], two GANs are jointly trained with cycle-consistency con-straints. However, in contrast to the CycleGAN [29], the ImaGAN allows to userto select the structure and the style of the generated image. For this ImaGAN iscomposed of two parts: (1) style removal network R that removes the style of animage such that only the edge information is left, and (2) generation network Gfills the style of the image whose style has been removed by style removal net-work. Experiments on edges2handbags and edges2shoes dataset revealed thatImaGAN can transfer the style of one image to another without distorting thetarget structure.

The contents of this paper is organized as follows. Section 2 introduces somerelated works. Section 3 describes the detailed formulation of the task, and theproposed architecture. Section 4 covers implementation and discusses the exper-imental results and compare it with other algorithms. In section 5, summariesand suggestions for future work are given.

Page 3: ImaGAN: Unsupervised Training of Conditional Joint ...to the CycleGAN [29], two GANs are jointly trained with cycle-consistency con-straints. However, in contrast to the CycleGAN [29],

ImaGAN: Unsupervised Training of Conditional Joint CycleGAN 3

2 Related Works

2.1 Artistic Style Transfer

The seminal work of Gatys et al. [7] applied convolutional neural networks forartistic style transfer. Motivated by this work, various methods were suggestedto improve the original neural style transfer algorithm in terms of speed, transfereffect, and controllability. Many initial methods were optimization-based [7, 17,18, 25]. Johnson et al. [12] introduced feed-forward neural network-based methodwhich is more than three-orders of magnitude faster than optimization-basedmethods while achieving similar quality of transfer result. Huang et al. [10] pro-posed adaptive instance normalization (AdaIN) which directly transforms thestatistics of the content feature maps by using the mean and standard deviationof the feature maps of the style image. Chen et al. [4] proposed a zero-shot styletransfer method which swapped the patches of content feature maps with thenearest style features. Recently, Shen et al. [22] adopted to use meta networksthat generate the filters of convolutional neural networks for arbitrary style trans-fer. Although the task that these algorithms attempt to solve is highly related tothe task this paper explores, directly applying those methods is not appropriatedue to structural distortion for artistic effect.

2.2 Image-to-Image Translation

Isola et al. [11] proposed the pix2pix architecture for translating image of onedomain to the image of another domain. The model could convert an edge baginto real bag, and segmentation map into real photo. Zhu et al. [29] proposedCycleGAN which is GAN with cycle consistency that eliminated the need ofpaired image dataset. Kim et al. [15] proposed the DiscoGAN that learns crossdomain relations using cycle consistency.

Numerous methods were proposed to tackle image-to-image translation withtask-specific constraints. Kaur et al. [14] proposed FaceTex that transfers thetexture of source face image to target face, and preserve the identity of the target.Chang et al. [2] proposed PairedCycleGAN, a method to makeup a face imagewith the makeup style of another face image, which assumed that make-up can becompletely disentangled from face identity information. Recently, Joo et al. [13]proposed a GAN-based method that generates fusion image with one’s identityand another’s pose. Joo et al. devised identity loss for identity preservation andMin-Patch training method for better optimization. Wenqi et al. [26] proposedTextureGAN to apply given texture to raw edge images using local texture loss.These previous works assumes the existence of two domains; one with stylizedimages and the other with un-stylized images. However, gathering such datasetis often prohibitively expensive. The proposed method is not limited to suchdomain setting.

Page 4: ImaGAN: Unsupervised Training of Conditional Joint ...to the CycleGAN [29], two GANs are jointly trained with cycle-consistency con-straints. However, in contrast to the CycleGAN [29],

4 K. Bae et al.

Fig. 2. Main architecture: style removal network R generates edge image using inputxA1 . Generation network G generates new style image xAG using reference style imagexA2 . Since cycle consistency has to be preserved, R network generates edge image usingxAG

1and G network tries to generate original xA1 using R(xAG

1) and xA1 . There are

four different discriminators that discriminates each domains. For third and fourthcycle, different discriminator compared to first and second cycle.

3 Proposed Method

3.1 Overview

Suppose there are two domains A∗ and B∗. The elements from domain A∗ sharesimilar structures (all bags look alike, and all shoes look alike), while they mayhave very different color and texture. The domain B∗, is the set of edge imagesfor A∗ . For any element sampled from A∗ is denoted as xA∗ ∈ Rw×h×c wherew, h, c denotes width, height and channel, there exists corresponding edge imagexB1

∈ B1 ⊂ Rw×h×c. Namely, the elements from domain B1 contain only thestructure information. Assume A2 and B2 are similarly defined and sampled

Page 5: ImaGAN: Unsupervised Training of Conditional Joint ...to the CycleGAN [29], two GANs are jointly trained with cycle-consistency con-straints. However, in contrast to the CycleGAN [29],

ImaGAN: Unsupervised Training of Conditional Joint CycleGAN 5

elements from four domains are stated in Fig. 3. It is assumed that disentanglingcontent and style information is possible.

Under this setting, the task is to generate image xAG1

which has the samestructure with xA1

and style with xA2. To achieve this, two function R and G are

defined which are implemented by encoder-decoder type deep neural networks.Given xA1

∈ A1, network R removes all style information and gives R(xA1).

With R(xA1) as structure information and xA2 as style image, network G gen-erates new stylized image xAG

1. The generated image xAG

1is again processed by

R so that the structural information is extracted, and is followed by G withxA1

to recover the original image xA1. This stage is applied in similar ways to

xAG1, xA2

, and, xAG2

as depicted in Fig. 2. For the generated images in the inter-mediate stage, domain classifiers with GAN type losses are added for realisticimage generation. To be specific, the images xA1 , xB1 , xA2 , and xB2 can denotea shoe image, the contour of xA1 , a handbag image, and the contour of xA2 ,respectively.

Fig. 3. Sampled images from set A1, A2, B1 and B2

3.2 Style Removal Network

To extract structure information, Style removal network R removes style of giveninput image and leaves structural information. Without removing the style ofan input image, generation produces results which are unpredictable. In theexperiment section, the ablation study for style removal network is provided.One straightforward method to overcome this would be to use the knowledgefrom neural style transfer algorithms such as AdaIN [10]. However, it was alreadymentioned that neural style transfer algorithms distort the target structure. Toovercome this challenge, network R removes the style from the input so thatthe generator cannot refer to the style information for the content image. WhenxA1 is processed by R, only structure information is left in the form of contour,and the result R(xA1

) must be the same, ideally, as xB1. Network R is trained

to minimize the loss, LR = LB1+ LB2

+ LGR, where LB1

, LB2, and LGR

aredefined as:

LB1 = E[||xB1 −R(xA1)||22],

LB2 = E[||xB2 −R(xA2)||22],

LGR= E[log(DB1(R(xA1))) + log(DB2(R(xA2)))]. (1)

Page 6: ImaGAN: Unsupervised Training of Conditional Joint ...to the CycleGAN [29], two GANs are jointly trained with cycle-consistency con-straints. However, in contrast to the CycleGAN [29],

6 K. Bae et al.

However, LB1and LB2

losses are not sufficient for training style removalnetwork. For realistic conditional image generation, GAN [8] type loss is alsoadopted. The function DB1 and DB2 try to discriminate whether R(xA1) andR(xA2) are in the domain B1 and B2, respectively, while R tries to fool the dis-criminators. Therefore the total loss of style removal Network LR is representedas

LR = LB1 + LB2 + LGR. (2)

3.3 Generation Network

Generation network G generates new style of object using style-removed im-age and style-reference image. Unlike style removal network, generation networktakes two inputs, R(xA1

) and xA2(or R(xA2

) and xA1). Generation network

will detach styles from reference style image xA2 , and apply it to R(xA1). Sim-ple concatenation scheme is used to accept two inputs at the same time. Cycleconsistency loss [29] was used to enforce the final image to match the originalimage. This part of the architecture gives G the ability to stylize the edge imageusing the reference image. Cycle consistency losses for the generated images arealso calculated to compensate the non-existence of the ground truth labels forthose images.

Lcyc = E[||G(R(xAG1

), xA1)− xA1

||22] + E[||G(R(xAG1G

), xA2)− xAG

1||22]

+ E[||G(R(xAG2

), xA2)− xA2

||22] + E[||G(R(xAG2G

), xA1)− xAG

2||22], (3)

where

xAG1

= G(R(xA1), xA2), xAG2

= G(R(xA2), xA1),

xAG1G

= G(R(xAG1

), xA2), xAG

2G= G(R(xAG

2), xA1

).

To constrain the structure of input image, mean squared error (MSE) lossbetween edge images is used. This structure loss preserves the structure whilereconstructing back to the original image by minimizing

Lstr = E[||R(xA1)−R(xAG1

)||22 + ||R(xAG1G

)−R(xAG1

)||22]

+ E[||R(xA2)−R(xAG2

)||22 + ||R(xAG2G

)−R(xAG2

)||22]. (4)

To generate more realistic images, GAN type loss is used. Assume there aretwo discriminator that discriminates whether inputs are in domain A1 and A2.Therefore, the generation loss of generation network can be expressed as:

LGG= E[log(DA1

(xAG1

)) + log(DA2(xAG2

))]

+ E[log(DA1(xAG

1G)) + log(DA2

(xAG2G

))]. (5)

where generation network tries to generate images that look more similar tothe elements in each domain A1 and A2. The reason why generation network

Page 7: ImaGAN: Unsupervised Training of Conditional Joint ...to the CycleGAN [29], two GANs are jointly trained with cycle-consistency con-straints. However, in contrast to the CycleGAN [29],

ImaGAN: Unsupervised Training of Conditional Joint CycleGAN 7

takes output of style removal network R(xA∗) rather than directly takes contentimage xA∗ , is to avoid reconstructing the same input. When xA1

and xA2are

given to the generation network, generation network simply outputs xA1 withoutchanging any style. This is due to cycle consistency loss. Without style removalnetwork R, Lcyc becomes 0 when G(a, b) = b where a, b ∈ Rw×h×c. To avoid thisunintended result, style removal network was introduced. Therefore, the wholegeneration process cannot be satisfied without style removal network. Finallythe total loss of generation network is the summation of cycle consistency loss,structure loss and generation loss which is represented as

LG = Lcyc + Lstr + LGG. (6)

3.4 Discriminator Network

Adversarial loss for four different discriminator networks DB1, DB2

, DA1and

DA2 was used. Let D∗ to denote a discriminator that judges whether input is indomain ∗. For each discriminator loss is represented as:

LDB1= E[log(DB1

(xB1)) + log(1−DB1

(R(xA1)))], (7)

LDB2= E[log(DB2

(xB2)) + log(1−DB2

(R(xA2)))], (8)

LDA1= E[log(DA1

(xA1)) + log(1−DA1

(xAG1

)) + log(1−DA1(xAG

1G))], (9)

LDA2= E[log(DA2(xB2)) + log(1−DA2(xAG

2)) + log(1−DA1(xAG

2G))]. (10)

The discriminator DA1accepts real image xA1

and fake image xAG1

, and triesto determine if they are elements of domain A1. The other discriminators DA2

,DB1 , and DB2 operate under by similar principles, but with appropriate realand fake images and for different domains A2, B1, and B2, respectively. SincexAG

∗Gis a function of R(xA∗), training DB1

and DB2comes ahead of training

DA1and DA2

. The total loss of Discriminator Network is given as summationof four different discriminator loss which is given as

LD = LDB+ LDA

(11)

where

LDB= LDB1

+ LDB2(12)

LDA= LDA1

+ LDA2. (13)

3.5 Training Networks

The goal of whole network is to minimize total loss which is expressed as

L = LD + LR + LG. (14)

Minimizing the total loss L at the same time is very difficult task. Therefore,four step training was devised to orchestrate all sub-networks.

Page 8: ImaGAN: Unsupervised Training of Conditional Joint ...to the CycleGAN [29], two GANs are jointly trained with cycle-consistency con-straints. However, in contrast to the CycleGAN [29],

8 K. Bae et al.

Algorithm 1 Training Algorithm for ImaGAN

Maximum number of iteration = NInitialize whole network

1: for t in 1:N do2: Sample k images with edge images from A1 ×B1: {xi

A1, xi

B1}ki=1

3: Sample k images with edge images from A2 ×B2: {xiA2

, xiB2}ki=1

4: Generate R(xiA1

) and R(xiA2

) for i = 1, ..., k5: Update DB1 and DB2 with respect to minimize LDB

LDB =

k∑i=1

2∑j=1

log(DBj (xiBj

)) + log(1−DBj (R(xiAj

)))

6: Generate xAG1

, xAG2

, xAG1G

and xAG2G

7: Update DA1 and DA2 with respect to minimize LDA

LDA =

k∑i=1

2∑j=1

log(DAj (xAj )) + log(1−DAj (xiAG

j)) + log(1−DAj (xi

AGjG

))

8: Update R network with respect to minimize LR

LR =

k∑i=1

2∑j=1

||xBj −R(xAj )||22 + log(DBj (R(xAj )))

9: Generate G(R(xAG1G

), xA2) and G(R(xAG2G

), xA1)

10: Update G network with respect to minimize LG

LG = Lcyc + Lstr + LGG

Lcyc =

k∑i=1

||G(R(xiAG

1), xi

A1)− xi

A1||22 + ||G(R(xi

AG2

), xiA2

)− xiA2||22

+||G(R(xiAG

1G), xi

A2)− xi

AG1||22 + ||G(R(xi

AG2G

), xiA1

)− xiAG

2||22

Lstr =

k∑i=1

2∑j=1

||R(xiA1

)−R(xiAG

1)||22 + ||R(xi

AG1G

)−R(xiAG

1)||22

LGG =

k∑i=1

2∑j=1

log(DAj (xiAG

j)) + log(DAj (xi

AGjG

))

The goal of the discriminators is to discriminate the results from style removalnetwork and generation network while the goal of these generation functions is tofool all discriminators at once. TrainingDB1 andDB2 begins the training process.The goal of this step is to minimize LDB

so that DB1 and DB2 are trained at thesame time. One caution in the course of training is to block the gradient flow soas not to update irrelevant networks. For example, when calculating the cycleconsistency loss for xAG

1as depicted in the second row of Fig. 2, the network R

Page 9: ImaGAN: Unsupervised Training of Conditional Joint ...to the CycleGAN [29], two GANs are jointly trained with cycle-consistency con-straints. However, in contrast to the CycleGAN [29],

ImaGAN: Unsupervised Training of Conditional Joint CycleGAN 9

that was used to generate xAG1

in the first row is not updated. Using the givenlosses, the whole training process is shown in Algorithm 1. Mini-batch settingwas used for practical implementation where k denotes the size of mini-batch.

4 Experiments

4.1 Dataset

In this paper, edges2shoes [27] and edges2handbags [28] dataset were used. Thesedataset were originally established for colorization when only edge images aregiven. However, in this paper, colored images were used as input images, andedge images were used as the ground truth labels to train style removal network.edges2shoes dataset is a shoe dataset consisting of 50k catalog images collectedfrom Zappos.com. The images are divided into 4 major categories: shoes, sandals,slippers, and boots. The shoes are centered on a white background and picturedin the same orientation for convenient analysis. edges2handbags dataset, con-sisting of 137k images collected from Amazon, is placed on white backgroundwith handbags looking in the same direction at the center like the edges2shoesdataset. Finally, at test time, a pair of shoe and handbag images are fed into thesystem to check if the style is transferred, and the structure is preserved.

Fig. 4. For all: a tuple of integer (a, b) denotes output channel size is a and filter sizeis b. Top: architecture of Style removal network. Middle: generation network. Thecircle with cross inside denotes concatenation of two input along channel dimension.Bottom: discriminator networks for inputs in domain A and B.

4.2 Network Architecture

For style removal network and generation network, encoder-decoder architecturewere used where the input size is down sized to 1

16 of its input size. For generation

Page 10: ImaGAN: Unsupervised Training of Conditional Joint ...to the CycleGAN [29], two GANs are jointly trained with cycle-consistency con-straints. However, in contrast to the CycleGAN [29],

10 K. Bae et al.

network, naive deep CNN architecture cannot have multiple inputs. Thereforeconcatenating two inputs over channel dimension was used. To generate imageswhere the pixels have bounded values, sigmoid function was used at the outputlayer. The strides were two for all convolution layers. For all generators anddiscriminators, leaky ReLU with α = 0.2 were used as a activation functionwhen performing the encoding process. In the decoding process, ReLU was usedas the activation function.

4.3 Implementation Detail

Implementation was conducted by using PyTorch [21] framework. For optimizingeach network, individual Adam optimizers [16] were used. For optimizing DB1

and DB2networks, initial learning rate was set to 0.0001, β1 = 0.5 and β = 0.999

with 0.0001 weight decay parameter. The same setting was used for DA1and DA2

networks. For style removal network and generation network, Adam optimizerwith initial learning rate 0.0001, β1 = 0.5 and β = 0.999 with no weight decayparameter. To get reasonable output, at least 10k iteration was required. Allinput images from domain A and B were resized to 64× 64 and batch size wasset to 200. When the number of sampled data from each domain are different,simple resampling was performed. The edges2handbags and edges2shoes datasetscontain 3 channel edge image rather than gray-scale image. Therefore 3 channeledge image was used instead of gray-scale image at training time. Titan X GPUwas used to train the proposed network. Using this GPU, it took about 8 ∼ 9hours to reach 10k iteration.

4.4 Qualitative Results

Using the proposed network, new style of shoes and handbags were generated.First shoe images were used as content images and handbag as style images togenerate new style shoes. Fig. 6 shows the bidirectional generation. The pro-posed ImaGAN architecture generated images with core structures preserved.For example, on the left part of the first row of Fig. 2, a black slip-on shoe wasgenerated from the brown slip-on shoe and the black handbag. Furthermore, abrown handbag was generated jointly.

Fig. 5 shows the result of comparing our method with other three methods:AdaIN [10], DiscoGAN [15], and CycleGAN [29]. Compared to our method,AdaIN tends to generate images that look more artistic. Although AdaIN alsosometimes generate reasonable images, the method is unstable compared to ourmethod. For instance, in the second, seventh, and eighth rows, AdaIN colorizesthe background with blurred shoe output image. DiscoGAN and CycleGANalso generates visually acceptable shoe images, but controlling the shape of thegenerated shoes is impossible. Therefore it is hard to obtain the desired shapeof generated shoes.

Page 11: ImaGAN: Unsupervised Training of Conditional Joint ...to the CycleGAN [29], two GANs are jointly trained with cycle-consistency con-straints. However, in contrast to the CycleGAN [29],

ImaGAN: Unsupervised Training of Conditional Joint CycleGAN 11

Inputs Results Input Results

Content Style Ours AdaIN[10] Style DiscoGAN[15] CycleGAN[29]

Fig. 5. Ours column denotes Our proposed method’s result. Our method can take bothcontent and style image and generate new style shoe image. AdaIN[10] can also takecontent and style image. However, DiscoGAN and CycleGAN can only take one input.Therefore results from DiscoGAN and CycleGAN have arbitrary shape of shoes whileour results have exact shape of shoes.

Page 12: ImaGAN: Unsupervised Training of Conditional Joint ...to the CycleGAN [29], two GANs are jointly trained with cycle-consistency con-straints. However, in contrast to the CycleGAN [29],

12 K. Bae et al.

4.5 Generating Bi-directionally

The proposed ImaGAN network can accept 2 different content domains usingsame network parameters. In Fig. 6 the bi-directional generation results for do-main A and B is shown. By using four different cycles, the proposed networkgenerates results bi-directionally with identical network parameters.

Content StyleOursShoe

AdaIN[10]Shoe

Content StyleOurs

HandbagAdaIN[10]Handbag

Fig. 6. Experimental results for the generating conditional shoes, handbags comparingwith AdaIN. For the same two input images, first column and third column representsthat input order is changed. For each inputs, second and fourth columns depicts thebi-directional results for our results and AdaIN[10].

4.6 Classification Score

It is difficult to quantitatively evaluate the performance of the proposed method.To obtain objectivity and reliability simultaneously, top-1 and top-5 classificationaccuracy was utilized as the quantitative measure. As shown in Table 1, the highclassification accuracy on generated images verifies that the generated images byproposed method look more reasonable.

Table 1. Top-1 and top-5 classification accuracy by ILSVRC 2012 [5] pre-trainedInception V3 [23].

Original dataset Ours AdaIN[10] CycleGAN[29] DiscoGAN[15]

Top-1 62.0% 54.0% 37.5% 49.0% 51.0%Top-5 89.0% 80.5% 65.5% 79.5% 77.0%

Page 13: ImaGAN: Unsupervised Training of Conditional Joint ...to the CycleGAN [29], two GANs are jointly trained with cycle-consistency con-straints. However, in contrast to the CycleGAN [29],

ImaGAN: Unsupervised Training of Conditional Joint CycleGAN 13

4.7 Conditional Generating from One Content Image to Many

The proposed network can generate multiple stylized image for single contentimage by switching into various style images. Cycle-consistency loss fused withother losses allows to tackle new type of problems. Different from CycleGAN [29],the proposed model can generate multiple images by varying the style image asin Fig. 7. The conditional one-to-many generation ability of ImaGAN originatedfrom the novel four-stream operation that shares parameters of R and G. Thismulti-tasking objective of R and G promotes stable learning.

Style

Conte

nt

Style

Conte

nt

Style

Conte

nt

Style

Conte

nt

Fig. 7. With single content image, multiple images can be generated with differentstyle images. The network not only applies handbag style to shoe content but alsoapplies shoe style to handbag content with same parameters.

4.8 Ablation study

One of the key novelties of our method is the use of edge image to preserve thestructure information. Without the edge extraction network R, the generatornetwork G prefers copying the input to creating a new, stylized content image as

Page 14: ImaGAN: Unsupervised Training of Conditional Joint ...to the CycleGAN [29], two GANs are jointly trained with cycle-consistency con-straints. However, in contrast to the CycleGAN [29],

14 K. Bae et al.

depicted in Fig. 2. The network R eliminates the memory about the style imageand lets G have the input and output from different domains, prohibiting simplecopying of input.

Input Results Input Results Input Results

A1 A2 w/ R w/o R A1 A2 w/ R w/o R A1 A2 w/ R w/o R

Fig. 8. The experimental result of ablation study. The results column presents thegenerated images under the condition of with and without R. (w/ R: with R, w/o R:without R)

5 Conclusions

This paper considered the problem of conditional image-to-image translationwith structural constraints. Previous neural style transfer methods were notsuitable for solving the task due to the inherent artistic rendering effect thatdestroys fine-grained structural information of the content image. Few and recentmethods of image-to-image translation considered similar tasks. In this work, theImaGAN, a novel architecture composed of two jointly trained CycleGANs thatfirst extracts the edge information from the content image, then stylizes it. Thesystem is end-to-end trainable and able o generate bi-directionally using sharedparameters. We observed competitive style transfer results with edges2handbagsand edges2shoes dataset, and demonstrated the stability of the method throughexperiments.

Using traditional edge detectors such as Sobel or Canny [20], can be utilizedto generate elements in domain B. Applying the proposed method to other do-mains such as clothes, furnitures, faces, etc is left as a future work. We expectthe proposed method to have practical applications that facilitate designing newstyle of objects ranging from shoes, bags to furnitures.

Acknowledgments. This research was supported by Basic Science ResearchProgram through the National Research Foundation of Korea(NRF) funded bythe Ministry of Science, ICT & Future Planning(NRF-2017R1A2B2006165) andInstitute for Information & communications Technology Promotion (IITP) grantfunded by the Korea government (MSIP) (No. 2018-0-00198), Object informa-tion extraction and real-to-virtual mapping based AR technology

References

1. Azadi, S., Fisher, M., Kim, V.G., Wang, Z., Shechtman, E., Darrell, T.: Multi-content GAN for few-shot font style transfer. CoRR abs/1712.00516 (2017),http://arxiv.org/abs/1712.00516

Page 15: ImaGAN: Unsupervised Training of Conditional Joint ...to the CycleGAN [29], two GANs are jointly trained with cycle-consistency con-straints. However, in contrast to the CycleGAN [29],

ImaGAN: Unsupervised Training of Conditional Joint CycleGAN 15

2. Chang, H., Lu, J., Yu, F., Finkelstein, A.: Pairedcyclegan: Asymmetric style trans-fer for applying and removing makeup. In: The IEEE Conference on ComputerVision and Pattern Recognition (CVPR) (June 2018)

3. Chen, D., Yuan, L., Liao, J., Yu, N., Hua, G.: Stylebank: An explicit representationfor neural image style transfer. In: 2017 IEEE Conference on Computer Vision andPattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. pp. 2770–2779 (2017)

4. Chen, T.Q., Schmidt, M.: Fast patch-based style transfer of arbitrary style. CoRRabs/1612.04337 (2016), http://arxiv.org/abs/1612.04337

5. Deng, J., Berg, A., Satheesh, S., Su, H., Khosla, A., Fei-Fei, L.: Ilsvrc-2012, 2012.URL http://www. image-net. org/challenges/LSVRC (2012)

6. Dumoulin, V., Shlens, J., Kudlur, M.: A learned representation for artistic style.CoRR abs/1610.07629 (2016), http://arxiv.org/abs/1610.07629

7. Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neu-ral networks. In: 2016 IEEE Conference on Computer Vision and Pattern Recog-nition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. pp. 2414–2423 (2016)

8. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair,S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Ghahramani, Z.,Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances inNeural Information Processing Systems 27, pp. 2672–2680. Curran Associates, Inc.(2014), http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf

9. Gupta, A., Johnson, J., Alahi, A., Fei-Fei, L.: Characterizing and improving stabil-ity in neural style transfer. In: IEEE International Conference on Computer Vision,ICCV 2017, Venice, Italy, October 22-29, 2017. pp. 4087–4096 (2017)

10. Huang, X., Belongie, S.J.: Arbitrary style transfer in real-time with adaptive in-stance normalization. In: IEEE International Conference on Computer Vision,ICCV 2017, Venice, Italy, October 22-29, 2017. pp. 1510–1519 (2017)

11. Isola, P., Zhu, J., Zhou, T., Efros, A.A.: Image-to-image translation with condi-tional adversarial networks. In: 2017 IEEE Conference on Computer Vision andPattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. pp. 5967–5976 (2017)

12. Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer andsuper-resolution. In: Computer Vision - ECCV 2016 - 14th European Conference,Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II. pp. 694–711 (2016)

13. Joo, D., Kim, D., Kim, J.: Generating a fusion image: One’s identity and another’sshape. CoRR abs/1804.07455 (2018), http://arxiv.org/abs/1804.07455

14. Kaur, P., Zhang, H., Dana, K.J.: Photo-realistic facial texture transfer. CoRRabs/1706.04306 (2017), http://arxiv.org/abs/1706.04306

15. Kim, T., Cha, M., Kim, H., Lee, J.K., Kim, J.: Learning to discover cross-domain relations with generative adversarial networks. In: Precup, D., Teh,Y.W. (eds.) Proceedings of the 34th International Conference on MachineLearning. Proceedings of Machine Learning Research, vol. 70, pp. 1857–1865.PMLR, International Convention Centre, Sydney, Australia (06–11 Aug 2017),http://proceedings.mlr.press/v70/kim17a.html

16. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. CoRRabs/1412.6980 (2014), http://arxiv.org/abs/1412.6980

17. Li, C., Wand, M.: Combining markov random fields and convolutional neural net-works for image synthesis. In: 2016 IEEE Conference on Computer Vision andPattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. pp.2479–2486 (2016)

Page 16: ImaGAN: Unsupervised Training of Conditional Joint ...to the CycleGAN [29], two GANs are jointly trained with cycle-consistency con-straints. However, in contrast to the CycleGAN [29],

16 K. Bae et al.

18. Li, Y., Wang, N., Liu, J., Hou, X.: Demystifying neural style transfer. In: Proceed-ings of the Twenty-Sixth International Joint Conference on Artificial Intelligence,IJCAI 2017, Melbourne, Australia, August 19-25, 2017. pp. 2230–2236 (2017)

19. Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., Yang, M.: Universal styletransfer via feature transforms. In: Advances in Neural Information Process-ing Systems 30: Annual Conference on Neural Information Processing Sys-tems 2017, 4-9 December 2017, Long Beach, CA, USA. pp. 385–395 (2017),http://papers.nips.cc/paper/6642-universal-style-transfer-via-feature-transforms

20. Maini, R., Aggarwal, H.: Study and comparison of various image edge detectiontechniques

21. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z.,Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch. In:NIPS-W (2017)

22. Shen, F., Yan, S., Zeng, G.: Meta networks for neural style transfer. CoRRabs/1709.04111 (2017), http://arxiv.org/abs/1709.04111

23. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the incep-tion architecture for computer vision. In: Proceedings of the IEEE conference oncomputer vision and pattern recognition. pp. 2818–2826 (2016)

24. Ulyanov, D., Lebedev, V., Vedaldi, A., Lempitsky, V.S.: Texture net-works: Feed-forward synthesis of textures and stylized images. In: Proceed-ings of the 33nd International Conference on Machine Learning, ICML2016, New York City, NY, USA, June 19-24, 2016. pp. 1349–1357 (2016),http://jmlr.org/proceedings/papers/v48/ulyanov16.html

25. Wilmot, P., Risser, E., Barnes, C.: Stable and controllable neural texture syn-thesis and style transfer using histogram losses. CoRR abs/1701.08893 (2017),http://arxiv.org/abs/1701.08893

26. Xian, W., Sangkloy, P., Agrawal, V., Raj, A., Lu, J., Fang, C., Yu, F., Hays, J.:Texturegan: Controlling deep image synthesis with texture patches. In: The IEEEConference on Computer Vision and Pattern Recognition (CVPR) (June 2018)

27. Yu, A., Grauman, K.: Fine-Grained Visual Comparisons with Local Learning. In:Computer Vision and Pattern Recognition (CVPR) (June 2014)

28. Zhu, J.Y., Krahenbuhl, P., Shechtman, E., Efros, A.A.: Generative visual manipu-lation on the natural image manifold. In: Proceedings of European Conference onComputer Vision (ECCV) (2016)

29. Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translationusing cycle-consistent adversarial networks. In: Computer Vision (ICCV), 2017IEEE International Conference on (2017)