ieee transactions on image processing i learning …cslzhang/paper/symmfcnet_vr_… · face images...

IEEE TRANSACTIONS ON IMAGE PROCESSING i

Learning Symmetry Consistent Deep CNNs forFace Completion

Xiaoming Li*, Guosheng Hu*, Jieru Zhu, Wangmeng Zuo, Senior Member, IEEE,Meng Wang, Senior Member, IEEE, and Lei Zhang, Fellow, IEEE,

Abstract—Deep convolutional networks (CNNs) have achievedgreat success in face completion to generate plausible facialstructures. These methods, however, are limited in maintainingglobal consistency among face components and recovering finefacial details. On the other hand, reflectional symmetry is aprominent property of face images and benefits face analysis andconsistency modeling, yet remaining uninvestigated in deep facecompletion. In this work, we leverage two kinds of symmetry-enforcing modules to form a symmetry-consistent CNN model(i.e., SymmFCNet) for effective face completion. For missingpixels on only one of the half-faces, an illumination-reweightedwarping subnet is developed to guide the warping and illumi-nation reweighting of the other half-face. As for missing pixelson both of half-faces, we present a generative reconstructionsubnet together with a perceptual symmetry loss to enforcesymmetry consistency of recovered structures. The SymmFCNetis constructed by stacking generative reconstruction subnet uponillumination-reweighted warping subnet, and can be learned inan end-to-end manner. Experiments show that SymmFCNet cangenerate globally consistent results on images with synthetic andreal occlusions, and performs favorably against state-of-the-arts.

Index Terms—Face completion, reflectional symmetry, convo-lutional neural networks.

I. INTRODUCTION

THE task of face completion is to fill missing facial pixelswith visually plausible hypothesis [7, 8]. The generated

solutions for missing parts aim at restoring semantic facialstructures and realistic fine details, but are not required toexactly approximate the unique ground-truth. Unlike imagesof natural scene, face images usually contain little repetitivestructures [1], which further increases the difficulties of face

* The first two authors contribute equally to this work.This work is partially supported by the National Natural Science Foundation

of China (NSFC) under Grant No.s 61671182 and U19A2073.X. Li is with the School of Computer Science and Technology, Harbin

Institute of Technology, Harbin 150001, China, and also with the Departmentof Computing, the Hong Kong Polytechnic University, Hong Kong. (E-mail:[email protected])

G. Hu is with the Anyvision Group, Belfast BT3 9AD, U.K. (E-mail:[email protected])

J. Zhu is with the School of Computer Science and Technology, Harbin Insti-tute of Technology, Harbin 150001, China. (E-mail: [email protected])

W. Zuo is with the School of Computer Science and Technology, HarbinInstitute of Technology, Harbin 150001, China, and also with the Peng ChengLab, Shenzhen, Guangdong 518066, China. (E-mail: [email protected])

M. Wang is with the School of Computer Science and InformationEngineering, Hefei University of Technology, Hefei 230601, China (E-mail:[email protected]).

L. Zhang is with the Department of Computing, the Hong Kong PolytechnicUniversity, Hong Kong. (E-mail: [email protected]).

Manuscript received xxx; revised xxx; accepted xxx. Wangmeng Zuo is thecorresponding author.

completion. Moreover, face completion can also be used inmany real world face-related applications such as unwantedcontent removal (e.g., glasses, scarf, and head-mounted dis-play in AR/VR), interactive face editing, and occluded facerecognition.

Recently, along with the development of deep learning,significant progress has been made in image inpainting [1, 2, 9–13] and face completion [1, 3, 5–7, 14, 15]. The existingmethods generally adopt the generative adversarial network(GAN) [16] framework which involves a generator and adiscriminator. On the one hand, contextual attention [1] andshift-connection [9, 10] have been introduced into the baselinegenerator (i.e., context-encoder [17]) to exploit surroundingrepetitive structures for generating visually plausible contentwith fine details. On the other hand, global and local discrimi-nators are incorporated to obtain visually pleasing result withlocally realistic details [7, 14], and semantic parsing loss [7]as well as structure information [4, 12] are also adopted toenhance the consistency of face completion.

However, face completion is not a simple application ofimage inpainting, and remains not well solved. Fig. 1 illustratesthe results by state-of-the-art CNN-based methods, includingYeh et al. [8], GFCNet [7], DeepFill [1], DFNet [2], PConv [3],EdgeConnect [4], GMCNN [5], and PICNet [6]. One cansee that most results are locally satisfying, but globallyinconsistent (e.g., the generated eyes or lips are different fromthe uncorrupted parts, making these results not very realistic.)

Inspired by this observation, we present a deep symmetry-consistent face completion network (SymmFCNet), whichleverages face symmetry to improve the global face structureconsistency and local fine details of face completion result. Thereflectional symmetry of face images, which has been widelyadopted to face recognition and consistency modeling [18–20],remains a non-trivial issue in face completion due to the influ-ence of illumination and pose. Non-symmetric lighting causesthe pixel values not equal to the corresponding ones in theother half-face. The deviation from frontal face further breaksthe reflectional symmetry and makes the pixel correspondencebetween two half-faces much more complicated.

To address the aforementioned difficulty of establishingsymmetrical correspondence, we first conduct analysis on thiscorrespondence which can be divided into three types. (1)The missing pixels in the input image correspond to non-occluded pixels in the flip image (red lines in Fig. 2 (a)),which indicates that these missing pixels can be filled by theirsymmetric ones. (2) The missing pixels in the input imagecorrespond to missing pixels in the flip image (green lines in

IEEE TRANSACTIONS ON IMAGE PROCESSING ii

Input DeepFill [1] DFNet [2] Pconv [3] EdgeConnect [4] GMCNN [5] PICNet [6] Ours

Ground-truth

Input DeepFill DFNet PConv EdgeConnect GMCNN PICNet Ours Ground-truth

Input DeepFill DFNet PConv EdgeConnect GMCNN PICNet Ours

Inp

ut

Res

ult

Gro

un

d-t

ruth

Co

nfi

den

ce C

Inp

ut

Co

nfi

den

ce C

Res

ult

Gro

un

d-t

ruth

Yeh et al [8] GFCNet [7]

Input Yeh [8] GFCNet [7] DeepFill [1] DFNet [2] PConv [3] EdgeConnect[4]GMCNN [5] PICNet [6] Ours

Fig. 1: Completion results of face images with synthetic and real occlusion. We can see that these competing methods arelocally plausible, but globally inconsistent. Obviously, for missing pixels in only one-half face (Row 1) and both-half faces(Row 2), results of the competing methods are globally inconsistent. On the contrary, with the incorporation of facial symmetry,our results are more realistic and globally consistent.

Flip

Fill

LightNet Illumination reweighting

Illumination Ratio R

FlowNet

oI,o fI ,f wI

,g g

i ix y

1ˆ sI ˆI

' ' ' ', ,,g g g g

i i i i

x y

x y x y ' ',g g

i ix y

' ' ' ', ,, ,g g g g

i i i i

x y g g

i ix y x yxy

oM , ,o f wM,o fM

, ,= 1oh ofw oMM M 1bh ohM MM

Stage 1 Output ImageWarp ImageFlip ImageInput

Flip Warp

Flip Warp

Input

Ours

Ground-truth

Input

Ours

Ground-truth

Shared WeightsPerceptual

Symmetry Loss

Flip

Attention Conv Block Elt-wise Product Concatenation Order

Flip

Shared Weights

Perceptual

Symmetry Loss

Flip

Correspondence between features

Skip Connection

Conv.

Dilation Conv.ResBlock

Correspondence between input and flip image Completion result w/o R

Completion result w/ R

Flip Warp

Flip

Fill



FlowNet

Shared Weights

Perceptual

Symmetry Loss

Flip


Skip Connection

Conv.




Flip

Fill



Shared Weights

Perceptual

Symmetry Loss

Flip


Skip Connection

Conv.




(a) Stage I: Illumination-reweighted warping subnet. (b) Stage II: Generative reconstruction subnet.

Confidence Map C

FlowNet

Input Io

Flip Image Io,f (Is1)

Stage I Result Is1

Mask Ms2

Motion Vector (∆x,∆y) of Flow Field Φ

Output Image I

Output Image If

Fig. 2: Overview of our SymmFCNet. Red, green and blue lines in (a) represent the three types of pixel-wise correspondencebetween the input and the flip image. Red: missing pixels (input) to non-occluded pixels (flip); Green: missing pixels (input) tomissing pixels (flip); Blue: remaining pixels (input) to remaining pixels (flip). In Stage I, the missing pixels in one-half face (redlines) are filled by reweighting the illumination of their symmetrical pixels in the other half-face. In Stage II, the missing pixelsin both-half faces (green lines in (a)) are filled by the generative reconstruction subnet with the incorporation of perceptualsymmetry loss for symmetry-consistent completion. Here, the confidence map C is shown in heat map as those in Figure 9.

Fig. 2 (a)), which indicates that these missing pixels can onlybe filled by generation. (3) The remaining pixels in the inputimage correspond to other pixels in the flip image (blue linesin Fig. 2 (a)), which should be preserved. Based on this, wepresent two mechanisms to leverage symmetric consistency forfilling two types (1, 2) of missing pixels.

For missing pixels happened on only one of the half-faces (see Fig. 2 (a), Stage I), it is natural to fill themby reweighting the illumination of the corresponding pixelsin the other half-face (the red correspondence in Fig. 2(a)). To cope with pose and illumination variations betweenhalf-faces, we suggest an illumination-reweighted warpingsubnet including two parts: (i) a FlowNet to establish thecorrespondence map between two half-faces, and (ii) a LightNetto indicate the ratio of illumination between two half-faces.For missing pixels happened on both half-faces (see Fig. 2(b), Stage II), perceptual symmetry loss is incorporated witha generative reconstruction subnet (RecNet) for symmetry-consistent completion. Based on the correspondence mapestablished by FlowNet, the perceptual symmetry loss isdefined on the decoder feature layer to alleviate the effect ofillumination inconsistency between two half-faces. To sum up,

our full SymmFCNet can be constructed by stacking generativereconstruction subnet upon illumination-reweighted warpingsubnet. Perceptual symmetry, reconstruction and adversariallosses are deployed on RecNet to end-to-end train the fullSymmFCNet. Besides, illumination consistency loss, landmarkloss and total variation (TV) regularization are employed inStage I (Fig. 2 (a)) for improving the training stability ofFlowNet and LightNet.

Experiments show that illumination-reweighted warping iseffective in filling missing pixels happened on only one ofthe half-faces. In contrast, the RecNet can not only generatesymmetry-consistent result for missing pixels happened onboth half-faces, but also benefit the refinement of the resultby illumination-reweighted warping. In terms of quantitativemetrics and visual quality, our SymmFCNet performs favorablyagainst state-of-the-arts, and yields high quality results on faceimages with real occlusions. More interestingly, we apply ourSymmFCNet to two new tasks, i.e., facial identity and attributerecognition, and achieve promising results.

The main contributions of this paper include:

• To our knowledge, the first attempt to utilize reflec-tional symmetry for deep face completion, including

IEEE TRANSACTIONS ON IMAGE PROCESSING iii

an illumination-reweighted warping network for fillingmissing pixels on only one of the half-faces and a gen-erative reconstruction network equipped with perceptualsymmetry loss for inpainting missing pixels on both half-faces.

• Favorable and globally consistent face completion perfor-mance on regular and irregular masks of SymmFCNet incomparison to state-of-the-arts.

• Benefited from facial symmetry, the first successfulapplication of face completion on facial identity andattribute recognition, achieving promising improvements.

The remainder of this paper is organized as follows. Sec-tion II briefly reviews the image completion and face symmetryrelating to this work. Section III initially analyzes threetypes of missing pixels and then provides the implementationdetails including MaskNet for predicting the face regionmask from an occluded face image, WarpNet for conductingcorrespondence between left and right faces, LightNet forillumination correction, as well as RecNet for generativereconstruction. Finally we give the objective function forlearning our SymmFCNet. Section IV makes both qualitativeand quantitative analyses on regular and irregular occlusionsto compare with the state-of-the-arts. Among them, we furtherevaluate the effect of the main components in SymmFCNet.The limitation and analyses are also detailed in this part. Finally,Section V summarizes this work and proposes the future workon the improvement of our SymmFCNet and the extension toother related tasks.

II. RELATED WORK

Face completion is to learn a mapping from the occludedinput to a plausible completion result. Unlike general imageinpainting, face image has its own characteristics, e.g., sym-metric structure. Our work is motivated by applying face-specific features to face completion. Thus, in this section, webriefly review the relevant works of three sub-fields: deepimage inpainting, deep face completion and the applicationsof symmetry in face analysis.

A. Deep Image Inpainting

Image inpainting aims to fill missing pixels in a seamlessmanner [21], which has wide applications, such as restorationof damaged image and unwanted content removal. Recently,motivated by the unprecedented success of GAN in many visiontasks like style transfer, image-to-image translation, imagesuper-resolution and face attribute manipulation, deep CNNshave also greatly facilitated the development of image inpaint-ing. Originally, Pathak et al. [17] presented an encoder-decoder(i.e., context encoder) network to learn the image semantic struc-ture for the recovery of the missing pixels, and an adversarialloss is deployed to enhance the perceptual quality of inpaintingresult. Subsequently, global and local discriminators [14] areadopted for better discrimination between real images andinpainting results. In addition, dilated convolution [5, 14]and partial convolution [3] are introduced to improve thetraining of generator. Xie et al. [22] proposed learnablebidirectional attention maps for feature renormalization and

effective inpainting of irregular holes. Structure prediction andforeground-aware [4, 12] are incorporated to reconstruct thereasonable structure. For harmonically blending the completionresults into existing content, Hong et al. [2] presented asmooth transition and fusion network. In [6], Cai et al. adoptedVariational Autoencoders (VAEs) to generate diverse solutionsfor image completion. To exploit the repetitive structuresin surrounding contexts, multi-scale neural patch synthesis(MNPS) [23] is suggested, and contextual attention [1] aswell as shift-connection [9, 10] are further presented toovercome the inefficiency of MNPS. Unlike natural images,face images generally exhibit non-repetitive structures andare more sensitive to semantic consistency and visual artifacts,making it difficult to directly apply general-purposed inpaintingmodels.

B. Deep Face CompletionApart from the aforementioned image inpainting methods,

Yeh et al. [8] developed a semantic face completion method,which exploits the trained GAN to find the closest encodingand then fill the missing pixels by considering both contextdiscriminator and corrupted input image. Li et al. [7] presenteda generative model to recover missing pixels by minimizingthe combination of reconstruction loss, local and globaladversarial losses as well as semantic parsing loss. Song etal. [15] proposed a geometry-aware face completion method byestimating facial landmark heatmaps from the unmasked region.He et al. [24] proposed the cross-spectral face completionmodel to fill the missing lighting contents in the synthesis fromthe near-infrared image to the visual one. For better recoveryof facial details, Zhao et al. [25] suggested a guidance imagefrom the extra non-occluded face image of the same identity tofacilitate identity-aware completion. However, the introductionof guidance image certainly limits its wide applications, andits performance degrades remarkably when the guidance andoccluded images are of different poses. Instead of selectingguidance image in the completion process, we leverage thesymmetry of face images to establish the correspondencebetween two half-faces, which is then used to guide thegeneration of high quality completion result.

C. Face SymmetrySymmetry is closely related to the human perception,

understanding and discovery of images, and also has receivedupsurging interests in computer vision [18–20, 26–28]. In com-putational symmetry, numerous methods have been proposedto detect reflection, rotation, translation and medial-axis-likesymmetries from images [26, 27]. Reflectional symmetry is alsoan important characteristic of face images, which has been usedto assist 3D face reconstruction [18], 3D face alignment [28]and face recognition [19, 20]. For identity-preserving facefrontalization, Huang et al. [29] adopted symmetry loss onpixel and Laplacian space of reconstruction results, whichshould meet two conditions: 1) face must locate in imagecenter, 2) roll and yaw of facial poses must be zero, making itonly applicable in frontalization task. Unlike [29], we presenta more general scheme for modeling face symmetry for facecompletion task.

IEEE TRANSACTIONS ON IMAGE PROCESSING iv

III. METHOD

Face completion aims at learning a mapping from occludedface Io as well as its binary indicator mask Mo to a desiredcompletion result I (i.e., an estimation of the ground-truthI). Here, the images Io, Mo, as well as I are of the samesize h× w, and we give a unified definition that mask valueat (i, j) that equals to 0 indicates missing. To exploit facesymmetry, we present our two-stage SymmFCNet to generatesymmetry-consistent completion result (see Fig. 2). In StageI, an illumination-reweighted warping subnet is deployed tofill missing pixels happened on only one of the half-faces,where a FlowNet is involved to establish the correspondencebetween two half-faces and a LightNet is involved to predict theillumination differences between two half-faces. In Stage II, agenerative reconstruction subnet is used to handle missingpixels happened on both half-faces and further refine theresult in Stage I. Using the output of FlowNet, we definea perceptual symmetry loss on the decoder feature to enforcesymmetry consistent completion. With the incorporation of facesymmetry in both stages, our SymmFCNet can well constrainthe completion results with plausible and realistic structures.Stage I and Stage II are combined together to handle differenttypes of occlusions and can be learned in an end-to-end manner.In this section, we first detail the architecture of SymmFCNetand then define the learning objective.

A. Illumination-reweighted warping

Unlike general-purposed image inpainting, face is a highlystructured object with prominent reflectional symmetry char-acteristic. Thus, when the missing pixels are within onlyhalf of the face, it is reasonable to fill them guided bythe corresponding pixels in the other half-face. To this end,we should solve the illumination inconsistence and createcorrespondence between the pixels from two half-faces. Inthe following, we introduce a MaskNet, a FlowNet and aLightNet for predicting the face region, computing pixelcorrespondence and illumination correction, respectively. Tofacilitate the understanding of these sub-networks, we representall the variables in Fig. 3 and Table I.

1) MaskNet: Since only face region has symmetry character-istic, it is necessary to predict the face region mask Mfr froman occluded face, which can apply the following LightNet andFlowNet on face region only. To achieve this goal, we suggesta MaskNet which takes the occluded image Io as input topredict the face region mask Mfr. This can be defined as:

Mfr = Fm(Io; Θm), (1)

where Θm denotes the MaskNet model parameters.For training MaskNet, we first generate the ground-truth

region mask Mfrgt by computing the convex hull of 68 facial

landmarks detected by [30]. The mask value at (i, j) equal to1 indicates the pixel belongs to the face region, otherwise thebackground. The training objective of MaskNet is defined as:

Lfr = ‖Mfr −Mfrgt ‖2. (2)

An example of Mfr is shown in Fig. 3.

2) FlowNet: One may establish the correspondence betweenthe pixels from two half-faces by direct matching. However,such approach is computationally expensive and the annotationof dense correspondence is also practically infeasible. Instead,we introduce the flip image Io,f (Mo,f ) of occluded face(mask) Io (Mo), and adopt a FlowNet which takes both Io,f

and Io to predict the flow field Φ = (Φx,Φy),

Φ = Fw(Io, Io,f ; Θw), (3)

where Θw denotes the FlowNet model parameters. Givena pixel (i, j) in Io, (Φx

i,j ,Φyi,j) indicates the position of

its corresponding pixel in Io,f . Note that Io,f is the flipimage of Io. Thus, Io(i, j) and Io,f (Φx

i,j ,Φyi,j) are a pair

of corresponding pixels from different half-faces (the linesshown in Fig. 2 (a)), and the correspondence between twohalf-faces is then constructed.

With Φ, the pixel value at (i, j) of the warped image Io,f,w

is defined as the pixel value at (Φxi,j ,Φ

yi,j) of the flipped image

Io,f . Since Φxi,j and Φy

i,j are real numbers, then Io,f(Φx

i,j ,Φyi,j)

can be bilinearly interpolated by its 4 surrounding neighboringpixels. Thus, the warped flip image Io,f,wi,j can be computedas the interpolation result:

Io,f,wi,j =∑

(h,w)∈NIo,fh,w max(0, 1−|Φy

i,j−h|) max(0, 1−|Φxi,j−w|),

(4)where N denotes the 4-pixel neighbors of (Φx

i,j ,Φyi,j). Analo-

gously, we can also obtain the warped flip mask image Mo,f,w

of Mo,f :

Mo,f,wi,j =

∑

(h,w)∈NMo,f

h,w max(0, 1−|Φyi,j−h|) max(0, 1−|Φx

i,j−w|).

(5)By defining the occlusion on one half-face with Mo-h = 1−Mo,f,w (1−Mo)Mfr, we can then identify the missingpixels (i, j) within only half of the face as Mo-h

i,j = 0. Here, represents the element-wise product.

Since it is unpractical to annotate the dense correspondencebetween left and right half-faces, alternative losses are requiredto train FlowNet. [31–33] directly enforced the losses on thewarped images. For face completion, however, the ground-truthof warped image is unknown, and the two half-faces may beof different illuminations, making it unsuitable to use Io asthe ground-truth.

Following [34], we train FlowNet in a semi-supervisedmanner by incorporating landmark loss with a TV regularizer,which is performed on the flow field Φ. Given the ground-truthimage I , we detect its 68 facial landmarks

(xk, yk)

∣∣68k=1

through [30]. Denote by If the horizontal flip of I . Thelandmarks of If , denoted by

(xfk , y

fk )∣∣68k=1

, can be obtained

by horizontal flip of

(xk, yk)∣∣68k=1

. In order to align Io,f

to the pose of Io, it is natural to require (Φxxfk ,y

fk

,Φy

xfk ,y

fk

) beclose to (xk, yk) (see 2-nd row in Fig. 3), and we thus definethe landmark loss as:

Llm =

68∑k=1

(Φxxfk ,y

fk

− xk)2 + (Φy

xfk ,y

fk

− yk)2. (6)

Furthermore, TV regularization is deployed to constrain the

IEEE TRANSACTIONS ON IMAGE PROCESSING v

Flip

Fill



FlowNet

oI,o fI ,f wI

,g g

i ix y

1ˆ sI ˆI

' ' ' ', ,,g g g g

i i i i

x y

x y x y ' ',g g

i ix y

' ' ' ', ,, ,g g g g

i i i i

x y g g

i ix y x yxy

oM , ,o f wM,o fM

, ,= 1oh ofw oMM M 1bh ohM MM

Stage 1 Output ImageWarp ImageFlip ImageInput

Flip Warp

Flip Warp

Input

Ours

Ground-truth

Input

Ours

Ground-truth

Shared WeightsPerceptual

Symmetry Loss

Flip

Attention Conv Block Elt-wise Product Concatenation Order

Flip

Shared Weights

Perceptual

Symmetry Loss

Flip


Skip Connection

Conv.




Flip Warp

Flip

Fill



FlowNet

Shared Weights

Perceptual

Symmetry Loss

Flip


Skip Connection

Conv.




Input Io Flip Image Io,f Warped Flip Io,f,w Stage I Result Is1 Face Mask Mfr

Mo Mo,f Mo,f,w One-half Mo-h=1−Mo,f,w(1−Mo)Mfr

Both-half Mb-h=1−(1−Mo)Mo-hMfr

(xi, yi)(x

fi, y

fi )

(Φx

xfi,y

fi

,Φy

xfi,y

fi

)

(Φx

xfi ,y

fi

,Φy

xfi ,y

fi

)≈ (xi, yi)

Fig. 3: Illustration of the variables in FlowNet and LightNet. Row 1: we illustrate the occlusion masks that generated bySymmFCNet. Row 2: we detail three types of correspondence between two half faces (red, green, and blue lines). And then,the illustration of using FlowNet to warp the flip image Io,f to align with input Io is briefly illustrated.

TABLE I: Descriptions of the variables in our SymmFCNet.

Variables Description of the variables

Mfr The face region mask that indicates the pixelbelongs to face region or background.

Mo The binary occlusion maskMo,f The horizontal flip image of Mo

Mo,f,w The warped image of Mo,f via bilinearinterpolation in Eqn. 5

Mo-hOne-half occlusion mask that indicates the missingpixels can be filled by the corresponding ones in the

other half-face (the red lines in Figs. 2 and 3)

Mb-hBoth-half occlusion mask that indicates the missing

pixels should be generated in stage II (the greenlines in Figs. 2 and 3)

Io The occluded face imageIo,f The horizontal flip image of Io

Io,f,wThe warped face image of Io,f via bilinear

interpolation in Eqn. 4

Is1The result of stage I that completes the missing

pixels in Mo-h

CConfidence map that indicates the completion

confidence of stage I

spatial smoothness of flow field Φ. Given the 2D dense flowfield (Φx,Φy), the TV regularization is defined as:

LTV = ‖∇xΦx‖2 + ‖∇yΦx‖2 + ‖∇xΦy‖2 + ‖∇yΦy‖2, (7)

where ∇x and ∇y denote the gradient operators along x andy coordinates, respectively.

3) LightNet: Generally, the left and right half-faces arelighting inconsistent, therefore, we cannot fill missing pixelsdirectly by Io,f,w. In order to compensate the illuminationvariation, we add the light adjustment module (LightNet) tomake the completion result more harmonious. LightNet takes

Io and Io,f as inputs, and predicts the illumination ratio R asfollows:

R = Fl(Io, Io,f ; Θl), (8)

where Θl denotes the LightNet model parameters and R hasthree channels (RGB map) for each input image. Io,f,w R isexpected to have the same illumination as Io. The completionresult in Stage I can be obtained by:

Is1 = Io,f,w R (1−Mo-h) + Io Mo-h. (9)

For training LightNet, we introduce an illumination con-sistency loss. Denote by If,w the warped version of the flipground-truth If . Then, the illumination reweighted If,w isrequired to approximate the original ground-truth I . And wethus define the illumination consistency loss as:

Ll = ‖(If,w R− I)Mfr‖2, (10)

where we exclude the effect of irrelevant background throughthe predicted face region mask Mfr.

B. Generative reconstruction

Since illumination reweighted warping can handle missingpixels within only one half of the face, we further present agenerative reconstruction subnet for the inpainting of missingpixels happened on both half-faces. It is worth noting thatwhen faces have large poses (e.g., 1-st row in Fig. 9), fewervalid correspondence can be utilized from symmetry to helpinpainting, leading to the decreasing reliability of Is1. Thus,directly taking M b-h as the input of Stage II will confuse thelearning of RecNet. Denote the value in Φ as (x+∆x, y+∆y),in which (x, y) represents the pixel index and (∆x,∆y) is themotion vector. To alleviate the former issue, we attempt to usethe motion vector (∆x,∆y) of Φ which can capture the posecondition to generate a confidence map with the value within[0, 1]. When the face image is not frontal, the motion vector in

IEEE TRANSACTIONS ON IMAGE PROCESSING vi

Φ should be larger, resulting in a smaller value in confidencemap. We take the motion vector (∆x,∆y) as input and gothrough three convolutional layers to generate confidence mapC (see Figs. 2 and 9), which can indicate the completionconfidence of Stage I and serve as mask condition for StageII. This confidence map can be formulated as,

C = Fc(∆x,∆y; Θc), (11)

where Θc denotes the parameters of three convolutional layersand Φ is the flow vector from FlowNet.

Denote occlusion on both half-face with M b-h = 1− (1−Mo)Mo-h Mfr, background missing mask M bg = (1−Mfr) (1−Mo). The mask condition of Stage II (Ms2) canbe formulated as Ms2 = ((1−Mo-h)C+M b-h)(1−M bg)(Fig. 2 (b)). Thus, generative reconstruction subnet (RecNet)takes Is1 (the completion result in stage I) and Ms2 as inputto generate the final completion result.

I = Fr(Is1,Ms2; Θr), (12)

where Θr represents the RecNet model parameters. Imageinpainting task is sensitive for receptive field, especially forirregular masks. Thus, for RecNet, we adopt multi-dilatedconvolution [35] based ResBlock architecture [36] to enlargereceptive field and use spectral norm [37] in each convolutionlayer. Please refer to Appendix A for more details.

The flow field Φ is further utilized to enforce the symmetryconsistency on the completion results of missing pixels onboth half-faces. RecNet also takes the flip versions of Is1 andMs2 as inputs to generate If . We define Ωl as the (L − l)-th layer of decoder feature map for Is1 and Ms2, Ωf

l (flipversion) for Is1,f and Ms2,f , where L is the network depth.By downsampling Φ, Ms2, Mfr to Φ↓, Ms2

↓ , Mfr↓ which

have the same size with Ωl, the perceptual symmetry loss canthen be defined as:

Ls=1Cl

∑i,j ((Ωl(i, j)−Ωf

l (Φx↓,i,j ,Φ

y↓,i,j))M

fr↓ (i, j)(1−Ms2

↓ (i, j)))2,

(13)where Cl denotes the channel number of Ωl. In our implemen-tation, we set l = 1 with feature size 128×128. Benefited fromLs, we can maintain symmetric consistency even for fillingthe missing pixels on both half-faces.

Reconstruction loss is introduced to require the final com-pletion result I be close to the ground-truth I , which involvestwo terms. The first one, `2 loss, is defined as the squaredEuclidean distance between I and I ,

`2 = ‖I − I‖2. (14)

Inspired by [38], the second term adopts the perceptual lossdefined on pre-trained VGG-Face [39]. Denote by Ψ the VGG-Face model, and Ψk the k-th layer (i.e., k = 5) of feature map.The perceptual loss is then defined as,

`perceptual =1

CkHkWk‖Ψk(I)−Ψk(I)‖2, (15)

where the Ck, Hk and Wk denote the channel number, heightand width of feature maps, respectively. Clearly, this perceptualloss is also beneficial to identity preservation. Then, the

reconstruction loss is defined as,

Lr = λr,2`2 + λr,p`perceptual. (16)

where λr,2 and λr,p are the tradeoff parameters for balancing`2 and `perceptual. This objective function can constrain thegenerated results I close to ground-truth I in pixel and featurespace.

Finally, adversarial loss is deployed to generate photo-realistic result. In [7, 14], global and local discriminators areexploited, where local discriminator is defined on the inpaintingresult of a hole. Considering that the hole may be irregular, itis inconvenient to define and learn local discriminator. Instead,we apply local discriminators to four specific facial parts, i.e.,left/right eye, nose and mouth. Thus, local discriminators areconsistent for any images with any missing masks, and facilitatethe learning process of SymmFCNet. For each part, we defineits local adversarial loss as,

à,pi = minΘ

maxDpi

EIpi∼pdata(Ipi )[logDpi(Ipi)]+

EIpi∼prec(Ipi )[log(1−Dpi(Ipi

))],(17)

where pdata(Ipi) and prec(Ipi

) stand for the distributions ofthe i-th part from I and I , respectively. Dpi

denotes the i-thpart discriminator and we adopt SAGAN discriminator [40]for each one. To sum up, the overall adversarial loss is definedas,

La = λa,gà,g +∑i

λa,pià,pi , (18)

where à,g represents the global adversarial loss working onthe whole image, λa,g and λa,pi

are the tradeoff parametersfor the global and local adversarial losses, respectively. Foreach part, we crop the square region based on the convex hullof its landmarks and employ bilinear interpolation to resize itto 128× 128.

C. Learning Objective

Taking all the losses on MaskNet, FlowNet, LightNet andRecNet into account, the overall objective of SymmFCNet canbe defined as,

L = λfrLfr +λlmLlm +λTV LTV +λlLl +λsLs +Lr +La,(19)

where λfr, λlm, λTV , λl and λs are the tradeoff parametersfor face region mask loss, landmark loss, TV regularization,illumination consistency loss and symmetry consistency loss,respectively. Note that our SymmFCNet is constructed bystacking generative reconstruction subnet upon illuminationreweighted warping subnet and can be trained in an end-to-endmanner. Thus, MaskNet, FlowNet and LightNet can also belearned by minimizing Lr, La and Ls even they are definedon RecNet.

IV. EXPERIMENTS

Experiments are conducted to assess our SymmFCNet andcompare it with the state-of-the-art image inpainting and facecompletion methods, including Yeh et al. [8], GFCNet [7],

IEEE TRANSACTIONS ON IMAGE PROCESSING vii

TABLE II: Quantitative comparisons (PSNR, SSIM, LPIPS) on two types masks (irregular and center) and voting results ofUser Study. Here, ↑ (↓) indicates higher (lower) is better.

Irregular Mask Center Mask User StudyMethods Image SizePSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓ Random Mask↑ Real Image↑

Yeh et al. [8] 64× 64 18.37 .743 .324 18.61 .773 .305 - -GFCNet [7] 128× 128 21.50 .840 .232 22.61 .926 .128 1.04% 0.96%

SymmFCNet128 128× 128 27.52 .961 .061 27.04 .959 .055 - -DeepFill [1] 256× 256 23.14 .882 .136 24.08 .945 .074 2.96% 3.60%DFNet [2] 256× 256 25.19 .912 .096 22.69 .942 .077 6.08% 5.68%PConv [3] 512× 512 - - - - - - 17.92% 12.32%

EdgeConnect [4] 256× 256 25.68 .921 .090 23.71 .948 .076 15.36% 17.84%GMCNN [5] 256× 256 24.03 .884 .141 25.30 .952 .059 12.04% 13.92%PICNet [6] 256× 256 24.90 .907 .106 24.29 .950 .065 19.76% 20.08%

SymmFCNet 256× 256 26.63 .942 .076 25.32 .953 .056 24.84% 25.60%

DeepFill [1], DFNet [2], PConv [3], EdgeConnect [4], GM-CNN [5], and PICNet [6]. For comprehensive evaluation,quantitative and qualitative results as well as user study arereported for comprehensive evaluation. In addition, we test thecompletion performance on both images with synthetic missingpixels and images with real occlusion. We also conduct theablation study to assess the contributions of main componentsof SymmFCNet including FlowNet, LightNet, and perceptualsymmetry loss. Moreover, we find that the face symmetry isbeneficial to identity and attribute preserving, and thus conductthe identity and attribute recognition to compare with thecompeting methods. Finally, we show the completion results ofour SymmFCNet on images with different poses and analyzethe limitation of SymmFCNet. The source code and modelsare available at: https://github.com/csxmli2016/SymmFCNet.

A. Dataset and Setting

We evaluate our proposed model on CelebA-HQ dataset[41, 42] and use the training, test, and validation splits from [6].We also adopt random crop and chromatic transformations(brightness, contrast) [43] for data augmentation. The hyper-parameters for SymmFCNet are set as follows: λfr=5, λlm=10,λTV = 1, λl = 10, λs = 100, λr,2 = 300, λr,p = 3, λa,g = 1,λa,p1 =λa,p2 =λa,p3 =λa,p4 =2. The training of SymmFCNetincludes three stages. (i) We first pre-train MaskNet, FlowNetand LightNet for 10 epochs. (ii) Fixed MaskNet, FlowNetand LightNet, we pre-train RecNet for 20 epochs. (iii) Finally,the full SymmFCNet is end-to-end trained by minimizing thelearning objective L. Each models are optimized using theADAM algorithm [44] with the learning rate of 2 × 10−4

and β1 = 0.5. Learning rate is reduced by 0.1 until thereconstruction loss on validation set becomes non-decreasing.The batch size is 4 and the training is stopped after 200 epochs.

B. Results on Images with Synthetic Missing Pixels

Quantitative and qualitative results are reported on ourSymmFCNet and seven state-of-the-art methods with theirreleased models, which are all trained using the same datasets.Among them, Yeh et al. [8] and GFCNet [7] are designed

for face completion, but it can only handle 64 × 64 and128 × 128 images, respectively. PConv [3] requires onlinemanual specification of missing masks, and we thus do notreport its quantitative metrics because of manually editing theocclusion masks for more than 2,000 test images. PICNet [6]can generate 50 samples and we only select the one withthe highest score as its result. We randomly select irregularocclusion masks from PConv [3] with average pre-fillingpercentage in Stage I 21.45% and center mask (pre-fillingpercentage in Stage I nearly 0.0%) to generate our test data.

1) Quantitative Results: Table II lists the PSNR, SSIM, andperceptual similarity (LPIPS) [45] on the test data with PConvmask (irregular) and center mask. As for Yeh et al. [8] andGFCNet [7], we simply downsample the ground-truth to thesame image size (i.e., 64 × 64 and 128 × 128, respectively)for computing these metrics. For a fair comparison, we retrainour SymmFCNet128 to process 128 × 128 images which isobviously better than GFCNet [7]. We can also have thefollowing observations. (i) For irregular (PConv) mask, ourmethod obviously outperforms other competing methods, e.g., 1dB higher than others in PSNR, which can be mainly attributedto the pre-filling in Stage I and perceptual symmetry loss inStage II. (ii) For center masks, which occlude both-half faces,our methods can still achieve favorable performance againstother methods, mainly indicating the effectiveness of perceptualsymmetry loss. Again our method can also achieve the bestLPIPS performance, which is more consistent with humanperception.

2) Qualitative Results: The solutions for face completionare neither unique nor required to exactly approximate theground-truth. Thus, qualitative comparison is conducted toshow the effectivenss of our methods. Here, we only reportthe recent state-of-the-art works by excluding Yeh et al. [8]and GFCNet [7]. Fig. 4 show the completion results onirregular and center masks. Benefited from the incorporationof illumination reweighted warping and perceptual symmetricloss, our SymmFCNet can achieve very promising inpaintingresults which preserve visually pleasing symmetry consistentdetails for missing pixels within only one-half faces, whileother methods are limited in maintaining global symmetry

https://github.com/csxmli2016/SymmFCNet

https://www.nvidia.com/research/inpainting/

IEEE TRANSACTIONS ON IMAGE PROCESSING viii


Ground-truth



Inp

ut

Res

ult

Gro

un

d-t

ruth

Co

nfi

den

ce C

Inp

ut

Co

nfi

den

ce C

Res

ult

Gro

un

d-t

ruth

Input DeepFill [1] DFNet [2] PConv [3] EdgeConnect [4] GMCNN [5] PICNet [6] Ours Ground-truth

Fig. 4: Completion results on irregular and center masks. For the one-half occlusion (Rows 1&2), the generated results ofSymmFCNet are more consistent with the non-occluded one (e.g., eyes location, shape and color). Also, our SymmFCNet canperform favorably against the competing methods on the both-half occlusion (Rows 3∼6).

consistency. Obviously, the completion results of DeepFill [1],DFNet [2] , PConv [3], EdgeConnect [4], GMCNN [5] andPICNet [6] are different from the non-occluded face region,e.g., colors, shapes and locations of right eyes in the 1-st and2-nd rows. As for center mask, our method can also generatecomparable results with GMCNN and PICNet, but with betterglobal consistency.

3) User Study: We conduct a user study to evaluate theperformance of our SymmFCNet under different occlusions.Specifically, it is conducted for synthetic masks with randomocclusion and real-world occlusions, which contain 50 imagesthat randomly selected from synthetic test data and 25 imageswith real occlusion, respectively. We display the results forunlimited time by different methods in random order to 50workers who are required to choose the one with the best globalconsistency and perception quality. We use the percent of thevotes of one particular algorithm against all votes to evaluatethe performance of the algorithms. Result shown in Table IIindicates that SymmFCNet has over 5% probability than 2-ndhighest method to be selected as the best one, indicating thesuperiority of SymmFCNet.

C. Results on Images with Real Occlusions

By manually specifying the missing masks, Fig. 5 shows thecompletion results on four face images with real occlusions.For the 1-st and 2-nd images, the occlusion are mainly in theone-half face, and the result by ours is globally more symmetryconsistent in comparison to other methods. Moreover, one cansee that even though there is unpleased structure in the non-occluded region (left face in the 2-nd row), our RecNet canfurther finetune the completion results which is filled by thesymmetric region in Stage I. As for the 3-rd and 4-th images,even the occlusions are nearly symmetric, SymmFCNet stillperforms favorably, validating the effectiveness of perceptualsymmetry loss.

D. Ablation Study

Two groups of experiments are conducted to assess thecontributions of main components in our SymmFCNet. In thefirst one, Fig. 6 shows the intermediate results of SymmFCNet,including warped images, and the completion results afterFlowNet, LightNet, and RecNet. From Fig. 6, we have thefollowing observations: (i) From the warped flip image Io,f,w,

IEEE TRANSACTIONS ON IMAGE PROCESSING ix


Ground-truth



Inp

ut

Res

ult

Gro

un

d-t

ruth

Co

nfi

den

ce C

Inp

ut

Co

nfi

den

ce C

Res

ult

Gro

un

d-t

ruth

Input DeepFill [1] DFNet [2] PConv [3] EdgeConnect[4] GMCNN [5] PICNet [6] Ours

Fig. 5: Completion results on images with real occlusion. Rows 1∼2 and Rows 3∼4 are the one-half and both-half occlusions,respectively.


Ground-truth



Inp

ut

Res

ult

Gro

un

d-t

ruth

Co

nfi

den

ce C

Inp

ut

Co

nfi

den

ce C

Res

ult

Gro

un

d-t

ruth

Occluded face Io Warped flip Io,f,w Is1 w/o R Is1 w/ R Final result I Ground-truth

Fig. 6: Intermediate results of SymmFCNet. From left to right: occluded face image Io, warped flip image Io,f,w obtainedby warping the flip image Io,f with the predicted flow field Φ (Eqn. 4), Is1 w/o R that is the result of Stage I without theillumination correction, Is1 w/ R that is the final result of Stage I with the illumination correction, and the final result I ofSymmFCNet.

IEEE TRANSACTIONS ON IMAGE PROCESSING x

TABLE III: Quantitative results on different components of SymmFCNet. ↑ (↓) indicates higher (lower) is better.

Metrics Plain RecNet SymmFCNet (-GL0) SymmFCNet (-L) SymmFCNet (-S) SymmFCNet (Full)

PSNR↑ 24.84 25.21 26.42 26.35 26.63SSIM↑ 0.898 0.903 0.933 0.927 0.942LPIPS↓ 0.091 0.088 0.079 0.081 0.076


Ground-truth



Inp

ut

Res

ult

Gro

un

d-t

ruth

Co

nfi

den

ce C

Input Io Plain RecNet SymmFCNet (-GL0) SymmFCNet (-L) SymmFCNet (-S) SymmFCNet (Full) Ground-truth

Fig. 7: Visual comparisons of our SymmFCNet components. From left to right: Input; Plain RecNet: removing FlowNet,LightNet and perceptual symmetry loss; SymmFCNet (-GL0): removing LightNet and applying the predicted flow field onlyin perceptual symmetry loss; SymmFCNet (-L): removing LightNet; SymmFCNet (-S): removing perceptual symmetry loss;SymmFCNet (Full): full SymmFCNet; and Ground-truth.

FlowNet can correctly align the flip image with the original one,and construct the correspondence between left and right half-faces. (ii) From the completion results of Is1 with and withoutillumination correction ratio R, one can see that although thecorrespondence can be used to fill missing pixels within onlyone half-face, the result suffers from illumination inconsistency,and can be improved via the introduction of LightNet. (iii)From the final result I , we can observe that RecNet not onlycan fill missing pixels on both half-faces, but also is effective infurther refining the result of illumination-reweighted warping.

In the second group of experiments, we further assess theeffect of perceptual symmetry loss, FlowNet, and LightNet.To this end, We consider five variants of SymmFCNet: (i)SymmFCNet (Full), (ii) SymmFCNet (-S): removing perceptualsymmetry loss, (iii) SymmFCNet (-L): removing LightNet,(iv) SymmFCNet (-GL0): removing LightNet and applying thepredicted flow field only in perceptual symmetry loss, (v) plainRecNet: removing FlowNet, LightNet and perceptual symmetryloss. Table III and Fig. 7 report the quantitative and qualitativeresults of these variants. Plain RecNet only performs on parwith GMCNN and PICNet (Table II) and is prone to symmetry-inconsistent completion results (Fig. 7). In the following, wemake further analyses of FlowNet, perceptual symmetry lossas well as LightNet.

1) FlowNet: The flow field by FlowNet can be exploitedfor (i) guiding the completion of missing pixels within onlyone half-face and (ii) incorporating with Ls to train RecNet.Here we only focus on (i) and compare SymmFCNet (-L) andSymmFCNet (-GL0). By using FlowNet to complete missingpixels within only one half-face, notable gains on PSNR andSSIM can be attained by SymmFCNet (-L) (see Table III).From Fig. 7, SymmFCNet (-GL0) is still limited in preservingthe symmetry consistency, while it can be well addressed bySymmFCNet (-L).

2) Perceptual symmetry loss: The contribution of perceptualsymmetry loss can be assessed by both SymmFCNet (-GL0)vs RecNet and SymmFCNet (Full) vs SymmFCNet (-S). Incomparison to plain RecNet, SymmFCNet (-GL0) can achievemoderate gains on quantitative metrics (see Table III) and moresymmetry consistent results (see Fig. 7). It is worth noting that,compared with SymmFCNet (-S), notable gains can be obtainedby SymmFCNet (Full). From Fig. 7, SymmFCNet (Full) isalso able to correct the artifacts and illumination inconsistencyproduced in the first stage. Thus, RecNet with perceptualsymmetry loss is helpful in filling missing pixels on bothhalf-faces and refining the result of Stage I.

3) LightNet: We further compare SymmFCNet (Full) withSymmFCNet (-L) to assess the contribution of LightNet. It can

IEEE TRANSACTIONS ON IMAGE PROCESSING xi

TABLE IV: Comparisons on identity and attribute recognition accuracy. One can observe that compared with other competingmethod, our SymmFCNet can well preserve the identity and facial attributes in irregular occlusion.

Attribute Accuracy (%)↑Methods RecognitionAccuracy (%)↑ Attractive Beard Male Makeup Smiling Young

GFCNet [7] 77.3 75.4 88.7 90.2 79.1 83.2 80.3DeepFill [1] 78.4 80.6 89.4 90.6 81.6 84.5 79.5DFNet [2] 83.2 89.2 92.6 94.1 92.3 86.4 91.5

EdgeConnect [4] 86.8 92.3 93.2 94.2 91.0 88.6 92.1GMCNN [5] 81.3 72.0 90.8 92.5 79.4 79.4 83.0PICNet [6] 85.6 91.2 94.2 96.1 93.9 89.1 92.7

SymmFCNet 91.2 93.4 96.2 97.0 95.2 90.2 94.9

Input DeepFill DFNet PConv EdgeConnect GMCNN PICNet Ours Ground-truthInput DeepFill DFNet PConv EdgeConnect GMCNN PICNet Ours Ground-truth

Input Ours Ground-truth

Fig. 8: Examples with top improvements in face recognition.

be seen that the introduction of LightNet can further improvethe quantitative performance (see Table III) and generateillumination consistent results (see Fig. 7). We also note thatRecNet benefits the correction of illumination inconsistency,and SymmFCNet (-L) attains the second best quantitativeperformance among the five SymmFCNet varaints. Even though,from the top image in Fig. 7, illumination inconsistency remainsobvious for the result by SymmFCNet (-L), indicating thatLightNet is still required and cannot be totally replaced byRecNet.

E. Facial Identity and Attribute RecognitionWe have evaluated the performance of SymmFCNet in the

above subsections. The face completion can visually improvethe quality of face images by filling the missing pixels. Actually,the face occlusion is one of the main challenges in many faceanalysis tasks such as facial identity and attribute recognition.It is natural to apply the completed faces to these tasks,which, however, have rarely been explored. Thus, in thissection, we evaluate the performance of facial identity andattribute recognition using all our testing data with irregularmasks with average pre-filling percentage in Stage I 21.45%in Section IV-B.

1) Face recognition: Face recognition has many challengessuch as pose, illumination, occlusion, etc. Facial symmetry isalso beneficial to identity and attribute preserving, that havebeen adopted in face recognition [19, 20]. Here, we explore theeffectiveness of symmetry based face completion for the faceocclusion problem. We use Openface toolbox [46] to measurewhether the result and ground-truth have the same identity toassess the coherence of completion result with surroundingcontext. From Table IV, it can be seen that SymmFCNetobviously exhibits better face recognition performance witha large margin compared with others, i.e., at least 4.4%higher than others. Among these test images, we analyze theimprovements of identity recognition by computing two featuredistances between input, ground-truth and ours, ground-truth. We observe that the top improvements are these withnearly one-half occlusion, which can be filled by these fromtheir symmetry pixels. Three examples are shown in Figure 8.The average distances between input, ground-truth and ours,ground-truth are 1.27×108 and 0.09 (the distance threshold forrecognition in Openface toolbox is 0.991), respectively, whichhave been improved by a large margin. This can be mainlyattributed to the symmetry completion by our SymmFCNet.

2) Attribute Recognition: Attribute Recognition is anotherimportant application which also suffers from the occlusionproblem. It is interesting to explore whether the face completioncan effectively preserve the facial attribute and the completedface can improve the attribute recognition performance. To eval-uate the attribute recognition accuracy using completed faces,we trained a classifier with ResNet [36] on CelebA [41], amongwhich we select 6 face related attributes (i.e., attractive, beard,male, makeup, smiling and young). Before the completion, thesix attributes recognition for the occluded images are 51.2%,70.8%, 54.6%, 49.5%, 50.7% and 68.7%, respectively, whichare more like random selection from Yes, Not. After thecompletion with our proposed SymmFCNet, the improvementsfor these attributes are 42.2%, 25.4%, 42.4%, 45.7%, 39.5%,26.2%, respectively. Among these six attributes, Attractive,Male and Makeup depend on the global face region and areimproved the most (i.e., at least 40%), which is benefited fromour global consistent results. It can also be seen from Table IV(right) that our SymmFCNet achieves the best performance (atleast 1% higher than the 2nd best methods), indicating that ourSymmFCNet can also benefit the facial attribute recognition.

1https://cmusatyalab.github.io/openface/demo-2-comparison/

IEEE TRANSACTIONS ON IMAGE PROCESSING xii

F. Performance on Different Poses

Since SymmFCNet may fail to exploit symmetry consistencywhen the face pose is far from frontal, i.e., the yaw angle is notwithin [−60, 60]. We use Multi-PIE [47] images in the samelighting condition but with different poses to illustrate the poselimitation in Fig. 9. We also demonstrate the confidence mapC predicted by flow field in the 2-th row. They are shown inheat map for better distinction. One can see that the confidencemap can be consistent with the pose variations. When the poseis large, the values in confidence map are small, indicatingthat the pre-filling in Stage I is not totally acceptable. Thus, byincorporating the both-half mask (M b-h) with the confidencemap, we can also take Ms2 as the input of Stage II. Finally,it is noted that not only our SymmFCNet but the competingmethods cannot generate plausible results for the profile images,partially due to the lack of profile images in the training sets.

V. CONCLUSION

This work presented a symmetry consistent CNN model,i.e., SymmFCNet, for effective face completion. Based onthe nearly reflectional symmetry of face image, we note thatthere are three types of pixels in face regions. For the missingpixels in the only one-half face, which can be filled fromthe symmetric parts, we first adopt a FlowNet to constructthe correspondence between two half-faces. And then thecorrespondence is combined with a LightNet for illumination-reweighting. Finally, for missing pixels in the both-half face,a RecNet is proposed with the incorporation of perceptualsymmetry loss for recovering the photo-realistic facial detailsand structure. Extensive experiments show the effectivenessof SymmFCNet in symmetry correspondence construction,illumination reweighting, as well as the generation of finalphoto-realistic results. In terms of quantitative evaluationmetrics, our SymmFCNet outperforms the state-of-the-arts witha large margin in irregular occlusion. As for center occlusion,our SymmFCNet can also achieve favorable performance.Moreover, the incorporation of symmetry in face completiontask can also benefit the identity and facial attribute preserving,and achieve obviously promising improvements compared withthe competing methods.

In the future, we will collect more facial images withdifferent poses to make our model robust to extreme face poses.Moreover, face symmetry can also easily extend to other facerelated works, e.g., facial attribute editing, face frontalization,face image restoration, and so on.

APPENDIX ANETWORK ARCHITECTURE OF SYMMFCNET

Our SymmFCNet consists of four sub-networks, i.e.,FlowNet, LightNet, MaskNet and RecNet. FlowNet, LightNetand MaskNet have the same architecture except the channelnumber of the output. Architectures are shown in Table V andTable VI. Here, Conv (d, k, s) and Dilated Conv (d, k, s, r)denote the convolutional layer and dilated convolutional layer,respectively. d, k, s and r represent output dimension, kernelsize, stride and dilation rate, respectively. BN and SN are batchand spectral normalization, respectively. Concat (i,j) indicates

TABLE V: Architectures of FlowNet, LightNet and MaskNet.

# FlowNet LightNet MaskNet

1 Input (6× 256× 256) Input (6× 256× 256) Input (3× 256× 256)2 Conv (32, 3, 1), LReLU3 Conv (32, 3, 2), BN, LReLU4 Conv (64, 3, 1), BN, LReLU5 Conv (64, 3, 2), BN, LReLU6 Conv (64, 3, 1), BN, LReLU7 Conv (128, 3, 2), BN, LReLU8 Conv (128, 3, 1), BN, LReLU9 Conv (256, 3, 2), BN, LReLU

10 Conv (256, 3, 1), BN, LReLU11 Conv (512, 3, 2), BN, LReLU12 Conv (512, 3, 1), BN, LReLU13 Dilated Conv (512,3,1,1), LReLU14 Dilated Conv (512,3,1,2), LReLU15 Dilated Conv (512,3,1,3), LReLU16 Dilated Conv (512,3,1,4), LReLU17 Concat (#13,14,15,16), Conv (512,3,1)18 Conv (512, 3, 1), BN, LReLU19 Upsample (2), Conv (2, 3, 1), Tanh20 Concat (#10,19), Conv (256, 3, 1), LReLU21 Conv (256, 3, 1), BN, LReLU22 Upsample (2), Conv (2, 3, 1), Tanh23 Concat (#8, 22), Conv (128, 3, 1), LReLU24 Conv (128, 3, 1), BN, LReLU25 Upsample (2), Conv (2, 3, 1), Tanh26 Concat (#6, 25), Conv (64, 3, 1), LReLU27 Conv (64, 3, 1), BN, LReLU28 Upsample (2), Conv (2, 3, 1), Tanh29 Concat (#4,28), Conv (32, 3, 1), BN, LReLU30 Conv (2, 3, 1), BN, LReLU31 Upsample (2), Conv (2, 3, 1), Tanh32 Upsample (2)33 Conv (2, 3, 1), Tanh Conv (3, 3, 1), ReLU Conv (3, 3, 1), Sigmoid34 Output (2× 256× 256) Output (3× 256× 256) Output (3× 256× 256)

the concatenation from the (#i)-th layer to the (#j)-th layervia skip connetion. LReLU is equipped with parameters of 0.2.Upsample (2) indicates 2× upsampling by bilinear interpolationoperation.

Our full model contains a total of 62.3 M parameters,including 9.9 M for LightNet, 9.9 M for FlowNet, 9.9 Mfor MaskNet and 32.6 M for RecNet. In terms of the runningtime, it takes 33.1 ms on average to handle a 256× 256 imagein inference phase, which is comparable to DFNet [2] (28.3 ms)and GMCNN [5] (29.5 ms), and faster than EdgeConnect [4](51.3 ms), DeepFill [1] (200.0 ms) and PICNet [6] (308.5 msfor generating 10 candidate images).

REFERENCES

[1] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang,“Generative image inpainting with contextual attention,” in IEEEConference on Computer Vision and Pattern Recognition, 2018,pp. 5505–5514.

[2] X. Hong, P. Xiong, R. Ji, and H. Fan, “Deep fusion networkfor image completion,” in ACM International Conference onMultimedia, 2019, pp. 2033–2042.

[3] G. Liu, F. A. Reda, K. J. Shih, T.-C. Wang, A. Tao, andB. Catanzaro, “Image inpainting for irregular holes using partialconvolutions,” in European Conference on Computer Vision,2018, pp. 85–100.

[4] K. Nazeri, E. Ng, T. Joseph, F. Qureshi, and M. Ebrahimi,“Edgeconnect: Structure guided image inpainting using edgeprediction,” in IEEE International Conference on ComputerVision Workshops, 2019.

[5] Y. Wang, X. Tao, X. Qi, X. Shen, and J. Jia, “Image inpaintingvia generative multi-column convolutional neural networks,” in

IEEE TRANSACTIONS ON IMAGE PROCESSING xiii

Inp

ut

Co

nfi

den

ce C

0

1

Input DeepFill [1] DFNet [2] Pconv [3] EdgeConnect [4] GMCNN [5] PICNet [6] Ours

Ground-truth



Inp

ut

Res

ult

Gro

un

d-t

ruth

Co

nfi

den

ce C

Inp

ut

Co

nfi

den

ce C

Res

ult

Gro

un

d-t

ruth

Yeh et al [8] GFCNet [7]

Input DeepFill [1] DFNet [2] Pconv [3] EdgeConnect [4] GMCNN [5] PICNet [6] OursYeh et al [8] GFCNet [7]



Co

nfi

den

ce C

Ours

Gro

un

d-t

ruth

GM

CN

NP

ICN

etD

eep

Fil

lD

FN

etP

Co

nv

Ed

geC

on

nec

t

Fig. 9: Comparisons on images with different types of occlusions and poses. From left to right is the both-half and one-halfocclusion. One can observe that with the incorporation of symmetry, our SymmFCNet can generate more globally consistentand plausible results on faces with the yaw pose within [-60, 60] than the competing methods.

IEEE TRANSACTIONS ON IMAGE PROCESSING xiv

TABLE VI: Network architectures of RecNet.

#Layer RecNet

1 Input (6× 256× 256)2 Conv (64, 3, 2), SN, LReLU3 ResBlock (Dilated Conv (64, 3, 1, 8), SN, LReLU)4 ResBlock (Dilated Conv (64, 3, 1, 6), SN, LReLU)5 Conv (128, 3, 2), SN, LReLU 646 ResBlock (Dilated Conv (128, 3, 1, 6), SN, LReLU)7 ResBlock (Dilated Conv (128, 3, 1, 4), SN, LReLU)8 Conv (256, 3, 2), SN, LReLU 329 ResBlock (Dilated Conv (256, 3, 1, 4), SN, LReLU)10 ResBlock (Dilated Conv (256, 3, 1, 2), SN, LReLU)11 Conv (512, 3, 2), SN, LReLU 1612 ResBlock (Dilated Conv (512, 3, 1, 1), SN, LReLU)13 ResBlock (Dilated Conv (512, 3, 1, 1), SN, LReLU)14 Conv (1024, 3, 1), SN, LReLU15 ResBlock (Dilated Conv (512, 3, 1, 1), SN, LReLU)16 ResBlock (Dilated Conv (512, 3, 1, 1), SN, LReLU)17 Dilated Conv (1024,3,1,2), LReLU18 Dilated Conv (1024,3,1,3), LReLU19 Dilated Conv (1024,3,1,4), LReLU20 Dilated Conv (1024,3,1,5), LReLU21 Concat (#16, 20), Conv (1024, 3, 1), SN, LReLU, ResBlock22 Upsample (2), Concat (#10, 21), Conv (512, 3, 1), SN, LReLU, ResBlock23 Upsample (2), Concat (#7, 22), Conv (256, 3, 1), SN, LReLU, ResBlock24 Upsample (2), Conv (128, 3, 1), SN, LReLU, ResBlock25 Upsample (2), Conv (64, 3, 1), SN, LReLU, ResBlock26 Conv (3, 3, 1), Tanh27 Output (3× 256× 256)

Advances in Neural Information Processing Systems, 2018, pp.331–340.

[6] C. Zheng, T.-J. Cham, and J. Cai, “Pluralistic image completion,”in IEEE Conference on Computer Vision and Pattern Recognition,2019, pp. 1438–1447.

[7] Y. Li, S. Liu, J. Yang, and M.-H. Yang, “Generative facecompletion,” in IEEE Conference on Computer Vision andPattern Recognition, 2017, pp. 3911–3919.

[8] R. A. Yeh, C. Chen, T.-Y. Lim, A. G. Schwing, M. Hasegawa-Johnson, and M. N. Do, “Semantic image inpainting with deepgenerative models,” in IEEE Conference on Computer Visionand Pattern Recognition, 2017, pp. 5485–5493.

[9] Y. Song, C. Yang, Z. Lin, X. Liu, Q. Huang, H. Li, and C. C. J.Kuo, “Contextual-based image inpainting: Infer, match, andtranslate,” in European Conference on Computer Vision, 2018,pp. 3–19.

[10] Z. Yan, X. Li, M. Li, W. Zuo, and S. Shan, “Shift-net:Image inpainting via deep feature rearrangement,” in EuropeanConference on Computer Vision, 2018, pp. 1–17.

[11] Y. Zeng, J. Fu, H. Chao, and B. Guo, “Learning pyramid-contextencoder network for high-quality image inpainting,” in IEEEConference on Computer Vision and Pattern Recognition, 2019,pp. 1486–1494.

[12] W. Xiong, J. Yu, Z. Lin, J. Yang, X. Lu, C. Barnes, and J. Luo,“Foreground-aware image inpainting,” in IEEE Conference onComputer Vision and Pattern Recognition, 2019, pp. 5840–5848.

[13] M.-c. Sagong, Y.-g. Shin, S.-w. Kim, S. Park, and S.-j. Ko,“Pepsi: Fast image inpainting with parallel decoding network,” inIEEE Conference on Computer Vision and Pattern Recognition,2019, pp. 11 360–11 368.

[14] S. Iizuka, E. Simo-Serra, and H. Ishikawa, “Globally and locallyconsistent image completion,” ACM Transactions on Graphics,vol. 36, no. 4, pp. 1–14, 2017.

[15] L. Song, J. Cao, L. Song, Y. Hu, and R. He, “Geometry-awareface completion and editing,” in AAAI Conference on ArtificialIntelligence, 2019, pp. 2506–2513.

[16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generativeadversarial nets,” in Advances in Neural Information Processing

Systems, 2014, pp. 2672–2680.[17] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A.

Efros, “Context encoders: Feature learning by inpainting,” inIEEE Conference on Computer Vision and Pattern Recognition,2016, pp. 2536–2544.

[18] R. Dovgard and R. Basri, “Statistical symmetric shape fromshading for 3d structure recovery of faces,” in EuropeanConference on Computer Vision, 2004, pp. 99–113.

[19] Y.-J. Song, Y.-G. Kim, U.-D. Chang, and H. B. Kwon, “Facerecognition robust to left/right shadows; facial symmetry,” Pat-tern Recognition, vol. 39, no. 8, pp. 1542–1545, 2006.

[20] J. Harguess and J. K. Aggarwal, “Is there a connection betweenface symmetry and face recognition?” in IEEE Conference onComputer Vision and Pattern Recognition Workshops, 2011, pp.66–73.

[21] M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester, “Imageinpainting,” in Proceedings of the 27th Annual Conference onComputer Graphics and Interactive Techniques, 2000, pp. 417–424.

[22] C. Xie, S. Liu, C. Li, M.-M. Cheng, W. Zuo, X. Liu, S. Wen, andE. Ding, “Image inpainting with learnable bidirectional attentionmaps,” in IEEE International Conference on Computer Vision,2019, pp. 8858–8867.

[23] C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, and H. Li,“High-resolution image inpainting using multi-scale neural patchsynthesis,” in IEEE Conference on Computer Vision and PatternRecognition, 2017, pp. 6721–6729.

[24] R. He, J. Cao, L. Song, Z. Sun, and T. Tan, “Adversarial cross-spectral face completion for nir-vis face recognition,” IEEETransactions on Pattern Analysis and Machine Intelligence,vol. 42, no. 5, pp. 1025–1037, 2019.

[25] Y. Zhao, W. Chen, J. Xing, X. Li, Z. Bessinger, F. Liu, W. Zuo,and R. Yang, “Identity preserving face completion for largeocular region occlusion,” in British Machine Vision Conference,2018.

[26] Y. Liu, H. Hel-Or, C. S. Kaplan, L. Van Gool et al., “Compu-tational symmetry in computer vision and computer graphics,”Foundations and Trends R© in Computer Graphics and Vision,2010.

[27] C. Funk, S. Lee, M. R. Oswald, S. Tsogkas, W. Shen, A. Cohen,S. Dickinson, and Y. Liu, “2017 iccv challenge: Detectingsymmetry in the wild,” in IEEE International Conference onComputer Vision Workshops, 2017, pp. 1692–1701.

[28] G. Passalis, P. Perakis, T. Theoharis, and I. A. Kakadiaris, “Usingfacial symmetry to handle pose variations in real-world 3dface recognition,” IEEE Transactions on Pattern Analysis andMachine Intelligence, vol. 33, no. 10, pp. 1938–1951, 2011.

[29] R. Huang, S. Zhang, T. Li, and R. He, “Beyond face rotation:Global and local perception gan for photorealistic and iden-tity preserving frontal view synthesis,” in IEEE InternationalConference on Computer Vision, 2017, pp. 2439–2448.

[30] A. Bulat and G. Tzimiropoulos, “How far are we from solvingthe 2d & 3d face alignment problem? (and a dataset of 230,0003d facial landmarks),” in IEEE International Conference onComputer Vision, 2017, pp. 1021–1030.

[31] Y. Ganin, D. Kononenko, D. Sungatullina, and V. Lempitsky,“Deepwarp: Photorealistic image resynthesis for gaze manipula-tion,” in European Conference on Computer Vision, 2016, pp.311–326.

[32] R. Yeh, Z. Liu, D. B. Goldman, and A. Agarwala, “Semanticfacial expression editing using autoencoded flow,” arXiv preprintarXiv:1611.09961, 2016.

[33] T. Zhou, S. Tulsiani, W. Sun, J. Malik, and A. A. Efros, “Viewsynthesis by appearance flow,” in European Conference onComputer Vision, 2016, pp. 286–301.

[34] X. Li, M. Liu, Y. Ye, W. Zuo, L. Lin, and R. Yang, “Learningwarped guidance for blind face restoration,” in EuropeanConference on Computer Vision, 2018, pp. 272–289.

[35] F. Yu and V. Koltun, “Multi-scale context aggregation by

IEEE TRANSACTIONS ON IMAGE PROCESSING xv

dilated convolutions,” in International Conference on LearningRepresentations, 2016.

[36] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learningfor image recognition,” in IEEE Conference on Computer Visionand Pattern Recognition, 2016, pp. 770–778.

[37] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectralnormalization for generative adversarial networks,” in Interna-tional Conference on Learning Representations, 2018.

[38] J. Johnson, A. Alahi, and F. Li, “Perceptual losses for real-timestyle transfer and super-resolution,” in European Conference onComputer Vision, 2016, pp. 694–711.

[39] O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep facerecognition,” in British Machine Vision Conference, 2015.

[40] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, “Self-attention generative adversarial networks,” in InternationalConference on Machine Learning, 2019.

[41] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning faceattributes in the wild,” in IEEE International Conference onComputer Vision, 2015, pp. 3730–3738.

[42] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressivegrowing of gans for improved quality, stability, and variation,”in International Conference on Learning Representations, 2018.

[43] B. Zoph, E. D. Cubuk, G. Ghiasi, T.-Y. Lin, J. Shlens, and Q. V.Le, “Learning data augmentation strategies for object detection,”arXiv preprint arXiv:1906.11172, 2019.

[44] D. Kingma and J. Ba, “Adam: A method for stochastic optimiza-tion,” in International Conference on Learning Representations,2015.

[45] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang,“The unreasonable effectiveness of deep features as a perceptualmetric,” in IEEE Conference on Computer Vision and PatternRecognition, 2018, pp. 586–595.

[46] B. Amos, B. Ludwiczuk, M. Satyanarayanan et al., “Openface:A general-purpose face recognition library with mobile appli-cations,” CMU-CS-16-118, CMU School of Computer Science,Tech. Rep., 2016.

[47] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker, “Multi-pie,” Image and Vision Computing, vol. 28, no. 5, pp. 807–813,2010.

Xiaoming Li received the B.Sc. and the M.Sc. degreein the School of Computer Science and Technologyfrom Harbin Institute of Technology, Harbin, China,in 2013 and 2016, respectively. He is currentlya Ph.D. student and works under the joint Ph.D.programmes in the School of Computer Scienceand Technology, Harbin Institute of Technology, andthe Department of Computing, The Hong KongPolytechnic University. His research interests includeimage editing and restoration.

Guosheng Hu is the honorary lecturer of QueensUniversity Belfast and Senior Researcher of AnyVi-sion. He was a postdoctoral researcher in the LEARteam, Inria Grenoble Rhone-Alpes, France from May2015 to May 2016. He finished his PhD in Centrefor Vision, Speech and Signal Processing, Universityof Surrey, UK in June, 2015. His research interestsinclude deep learning, pattern recognition, biometrics(mainly face recognition), and graphics.

Jieru Zhu received the B.Sc. in Bioinformatics in theSchool of Computer Science and Technology fromthe Harbin Institute of Technology, Harbin, China, in2019. She is currently a Master student in the Schoolof Computer Science and Technology, Harbin Insti-tution of Technology. Her current interests includeimage restoration and medical image segmentation.

Wangmeng Zuo (M’09-SM’14) received the Ph.D.degree in computer application technology fromthe Harbin Institute of Technology, Harbin, China,in 2007. From 2004 to 2006, he was a ResearchAssistant with the Department of Computing, TheHong Kong Polytechnic University, Hong Kong.From 2009 to 2010, he was a Visiting Professor withMicrosoft Research Asia. He is currently a Professorwith the School of Computer Science and Technology,Harbin Institute of Technology. He has publishedover 100 papers in top-tier academic journals and

conferences. His current research interests include image enhancement andrestoration, image/video generation, and image classification. He has served asa Tutorial Organizer in ECCV 2016, an Associate Editor of the IET Biometrics,and a Senior Editor of Journal of Electronic Imaging.

Meng Wang (SM’17) received the B.E. degree andthe Ph.D. degree in the special class for the giftedyoung from the Department of Electronic Engineeringand Information Science, University of Science andTechnology of China, Hefei, China, in 2003 and2008, respectively. He is currently a Professor withthe Hefei University of Technology, Hefei, China. Hiscurrent research interests include multimedia contentanalysis, computer vision, and pattern recognition. Hehas authored more than 200 book chapters, journaland conference papers in these areas. He is the

recipient of the ACM SIGMM Rising Star Award 2014. He is an associateeditor of IEEE Transactions on Knowledge and Data Engineering (IEEETKDE), IEEE Transactions on Circuits and Systems for Video Technology(IEEE TCSVT), IEEE Transactions on Multimedia (IEEE TMM), and IEEETransactions on Neural Networks and Learning Systems (IEEE TNNLS).

Lei Zhang (M’04, SM’14, F’18) received his B.Sc.degree in 1995 from Shenyang Institute of Aeronau-tical Engineering, Shenyang, P.R. China, and M.Sc.and Ph.D degrees in Control Theory and Engineeringfrom Northwestern Polytechnical University, Xi’an,P.R. China, in 1998 and 2001, respectively. From2001 to 2002, he was a research associate in the De-partment of Computing, The Hong Kong PolytechnicUniversity. From January 2003 to January 2006 heworked as a Postdoctoral Fellow in the Departmentof Electrical and Computer Engineering, McMaster

University, Canada. In 2006, he joined the Department of Computing, TheHong Kong Polytechnic University, as an Assistant Professor. Since July 2017,he has been a Chair Professor in the same department. His research interestsinclude Computer Vision, Image and Video Analysis, Pattern Recognition,and Biometrics, etc. Prof. Zhang has published more than 200 papers in thoseareas. As of 2019, his publications have been cited more than 46,000 times inliterature. Prof. Zhang is a Senior Associate Editor of IEEE Trans. on ImageProcessing, and is/was an Associate Editor of IEEE Trans. on Pattern Analysisand Machine Intelligence, SIAM Journal of Imaging Sciences, IEEE Trans.on CSVT, and Image and Vision Computing, etc. He is a “Clarivate AnalyticsHighly Cited Researcher” from 2015 to 2019. More information can be foundin his homepage http://www4.comp.polyu.edu.hk/∼cslzhang.

ieee transactions on image processing i learning …cslzhang/paper/symmfcnet_vr_… · face images...

Documents