maintaining natural image statistics with the contextual...

Maintaining Natural Image Statistics with theContextual Loss

Roey Mechrez?, Itamar Talmi?, Firas Shama, Lihi Zelnik-Manor

The Technion - Israel{roey@campus,titamar@campus,shfiras@campus,lihi@ee}.technion.ac.il

Fig. 1. We demonstrate the advantages of using a statistical loss for training aCNN in: (left) single-image super resolution, and, (right) surface normal estima-tion. Our approach is easy to train and yields networks that generate images (orsurfaces) that exhibit natural internal statistics.

Abstract. Maintaining natural image statistics is a crucial factor inrestoration and generation of realistic looking images. When trainingCNNs, photorealism is usually attempted by adversarial training (GAN),that pushes the output images to lie on the manifold of natural images.GANs are very powerful, but not perfect. They are hard to train andthe results still often suffer from artifacts. In this paper we propose acomplementary approach, that could be applied with or without GAN,whose goal is to train a feed-forward CNN to maintain natural inter-nal statistics. We look explicitly at the distribution of features in animage and train the network to generate images with natural featuredistributions. Our approach reduces by orders of magnitude the numberof images required for training and achieves state-of-the-art results onboth single-image super-resolution, and high-resolution surface normalestimation.

? indicate authors contributed equally

arX

iv:1

803.

0462

6v3

[cs

.CV

] 1

8 Ju

l 201

8

2 Roey Mechrez*, Itamar Talmi*, Firas Shama, Lihi Zelnik-Manor

1 Introduction

“Facts are stubborn things, but statistics are pliable.” — Mark Twain

Maintaining natural image statistics has been known for years as a key factorin the generation of natural looking images [1,2,3,4,5]. With the rise of CNNs,the utilization of explicit image priors was replaced by Generative AdversarialNetworks (GAN) [6], where a corpus of images is used to train a network to gener-ate images with natural characteristics, e.g., [7,8,9,10]. Despite the use of GANswithin many different pipelines, results still sometimes suffer from artifacts, bal-anced training is difficult to achieve and depends on the architecture [8,11].

Well before the CNN era, natural image statistics were obtained by utilizingpriors on the likelihood of the patches of the generated images [2,5,12]. Zoran andWeiss [5] showed that such statistical approaches, that harness priors on patches,lead in general to restoration of more natural looking images. A similar conceptis in the heart of sparse-coding where a dictionary of visual code-words is usedto constrain the generated patches [13,14]. The dictionary can be thought as aprior on the space of plausible image patches. A related approach is to constrainthe generated image patches to the space of patches specific to the image to berestored [15,16,17,18]. In this paper we want to build on these ideas in orderto answer the following question: Can we train a CNN to generate images thatexhibit natural statistics?

The approach we propose is a simple modification to the common practice.As typically done, we as well train CNNs on pairs of source-target images byminimizing an objective that measures the similarity between the generatedimage and the target. We extend on the common practice by proposing to usean objective that compares the feature distributions rather than just comparingthe appearance. We show that by doing this the network learns to generate morenatural looking images.

A key question is what makes a suitable objective for comparing distribu-tions of features. The common divergence measures, such as Kullback-Leibler(KL), Earth-Movers-Distance and χ2, all require estimating the distributionof features, which is typically done via Multivariate Kernel-Density-Estimation(MKDE). Since in our case the features are very high dimensional, and sincewe want a differentiable loss that can be computed efficiently, MKDE is not anoption. Hence, we propose instead an approximation to KL that is both simpleto compute and tractable. As it turns out, our approximation coincides with therecently proposed Contextual loss [19], that was designed for comparing imagesthat are not spatially aligned. Since the Contextual is actually an approximationto KL, it could be useful as an objective also in applications where the imagesare aligned, and the goal is to generate natural looking images.

Since all we propose is utilizing the Contextual loss during training, our ap-proach is generic and can be adopted within many architectures and pipelines.In particular, it can be used in concert with GAN training. Hence, we chosesuper-resolution as a first test-case, where methods based on GANs are the cur-rent state-of-the-art. We show empirically, that using our statistical approach

Maintaining Natural Image Statistics with the Contextual Loss 3

Fig. 2. Training with a statistical loss: To train a generator network Gwe compare the output image G(s)≡ x with the corresponding target image yvia a statistical loss — the Contextual loss [19] — that compares their featuredistributions.

with GAN training outperforms previous methods, yielding images that are per-ceptually more realistic, while reducing the number of required training imagesby orders of magnitude.

Our second test-case further proves the generality of this approach to datathat is not images, and shows the strength of the statistical approach withoutGAN. Maintaining natural internal statistics has been shown to be an importantproperty also in the estimation of 3D surfaces [15,20,21]. Hence, we presentexperiments on surface normal estimation, where the network’s input is a high-resolution image and its output is a map of normals. We successfully generatenormal-maps that are more accurate than previous methods.To summarize, the contributions we present are three-fold:

1. We show that the Contextual loss [19] can be viewed as an approximationto KL divergence. This makes it suitable for reconstruction problems.

2. We show that training with a statistical loss yields networks that maintainthe natural internal statistics of images. This approach is easy to train andreduces significantly the training set size.

3. We present state-of-the-art results on both perceptual super-resolution andhigh-resolution surface normal estimation.

2 Training CNNs to Match Image Distributions

2.1 Setup

Our approach is very simple, as depicted in Figure 2. To train a generator net-work G we use pairs of source s and target y images. The network outputsan image G(s) ≡ x, that should be as similar as possible to the target y. To


measure the similarity we extract from both x and y dense features X = {xi}and Y ={yj}, respectively, e.g., by using pretrained network or vectorized RGBpatches. We denote by PY and PX the probability distribution functions over Yand X, respectively. When the patch set X correctly models Y then the underly-ing probabilities PY and PX are equal. Hence, we need to train G by minimizingan objective that measures the divergence between PX and PY .

2.2 Review of the KL-divergence

There are many common measures for the divergence between two distributions,e.g., χ2, Kullback-Leibler (KL), and Earth Movers Distance (EMD). We optedfor the KL-divergence in this paper.

The KL-divergence between the two densities PX and PY is defined as:

KL(PX ||PY ) =

∫PX log

PX

PY(1)

It computes the expectation of the logarithmic difference between the probabil-ities PX and PY , where the expectation is taken using the probabilities PX . Wecan rewrite this formula in terms of expectation:

KL(PX ||PY ) = E[

logPX − logPY

](2)

This requires approximating the parametrized densities PX , PY from the fea-ture sets X = {xi} and Y = {yj}. The most common method for doing thisis to use Multivariate Kernel Density Estimation (MKDE), that estimates theprobabilities PX(pk) and PY (pk) as:

PX(pk) =∑xi∈X

KH(pk, xi) PY (pk) =∑yj∈Y

KH(pk, yj) (3)

Here KH(pk, ξ) is an affinity measure (the kernel) between some point pk andthe samples ξ, typically taken as standard multivariate normal kernel KH(z) =(2π)−d/2|H|−1/2 exp(− 1

2zTHz), with z = dist(pk, ξ), d is the dimension and H is

a d×d bandwidth matrix. We can now write the expectation of the KL-Divergenceover the MKDE with uniform sample grid as follows:

KL(PX ||PY ) =1

N

∑pk

[logPX(pk)− logPY (pk)

](4)

A common simplification that we adopt is to compute the MKDE not over aregular grid of {pk} but rather on the samples {xi} directly, i.e., we set pk =xk,which implies KH(pk, xi) = KH(xk, xi) and KH(pk, yj) = KH(xk, yj). Puttingthis together with Eq. (3) into Eq. (4) yields:

KL(PX ||PY ) =1

N

∑k

[log

∑xi∈X

KH(xk, xi)− log∑yj∈Y

KH(xk, yj)]

(5)

To use Eq. (5) as a loss function we need to choose a kernel KH . The mostcommon choice of a standard multivariate normal kernel requires setting thebandwidth matrix H, which is non-trivial. Multivariate KDE is known to besensitive to the optimal selection of the bandwidth matrix and the existing solu-tions for tuning it require solving an optimization problem which is not possibleas part of a training process. Hence, we next propose an approximation, whichis both practical and tractable.


2.3 Approximating the KL-divergence with the Contextual loss

Our approximation is based on one further observation. It is insufficient to pro-duce an image with the same distribution of features, but rather we want addi-tionally that the samples will be similar. That is, we ask each point xi ∈ X tobe close to a specific yj ∈ Y .

To achieve this, while assuming that the number of samples is large, we setthe MKDE kernel such that it approximates a delta function. When the kernelis a delta, the first log term of Eq. (5) becomes a constant since:

KH(xk, xi) =

{≈ 1 if k = i

≈ 0 otherwise(6)

The kernel in the second log term of Eq. (5) becomes:

KH(xk, yj) =

{≈ 1 if dist(xk, yj)� dist(xk, yl) ∀l 6= j

≈ 0 otherwise(7)

We can thus simplify the objective of Eq. (5):

E(x, y) = − log

(1

N

∑j

maxiAij

)(8)

where we denote Aij = KH(xk, yj). Next, we suggest two alternatives for thekernel Aij , and show that one implies that the objective of Eq. (8) is equivalentto the Chamfer Distance [22], while the other implies it is equivalent to theContextual loss of [19].

The Contextual Loss: As it turns out, the objective of Eq. (8) is identical tothe Contextual loss recently proposed by [19]. Furthermore, they set the kernelAij to be close to a delta function, such that it fits Eq. (7). First the Cosine (orL2) distances dist(xi, yj) are computed between all pairs xi,yj . The distances

are then normalized: dij = dist(xi, yj)/(mink dist(xi, yk) + ε) (with ε = 1e−5),and finally the pairwise affinities Aij ∈ [0, 1] are defined as:

Aij =exp

(1− dij/h

)∑

l exp(

1− dil/h) =

{≈ 1 if dij� dil ∀l 6= j

≈ 0 otherwise(9)

where h>0 is a scalar bandwidth parameter that we fix to h=0.1, as proposedin [19]. When using these affinities our objective equals the Contextual loss of [19]and we denote E(x, y) = LCX(x, y).


Affinity value (a) The Contextual loss [19] (b) Chamfer distance [22]

Fig. 3. The Contextual loss vs. Chamfer Distance: We demonstrate via a2D example the difference between two approximations to the KL-divergence (a)The Contextual loss and (b) Chamfer Distance. Point sets Y and X are markedwith blue and orange squares respectively. The colored lines connect each yj withxi with the largest affinity. The KL approximation of Eq. (8) sums over theseaffinities. It can be seen that the normalized affinities used by the Contextualloss lead to more diverse and meaningful matches between Y and X.

The Chamfer Distance A simpler way to set Aij is to take a Gaussian kernelwith a fixed H = I s.t. Aij = exp(−dist(xi, yj)). This choice implies that mini-mizing Eq. (8) is equivalent to minimizing the asymmetric Chamfer Distance [22]between X,Y defined as

CD(X,Y ) =1

|X|∑i

minjdist(xi, yj) (10)

CD has been previously used mainly for shape retrieval [23,24] were the pointsare in R3. For each point in set X, CD finds the nearest point in the set Y andminimizes the sum of these distances. A downside of this choice for the kernelAij is that it does not satisfy the requirement in Eq. (7).

The Contextual Loss vs. Chamfer Distance To provide intuition on thedifference between the two choices of affinity functions we present in Figure 3 anillustration in 2D. Recall, that both the Contextual loss LCX and the Chamferdistance CD find for each point in Y a single match in X, however, these matchesare computed differently. CD selects the closet point, hence, we could get thatmultiple points in Y are matched to the same few points in X. Differently, LCX

computes normalized affinities, that consider the distances between every pointxi to all the points yj ∈Y . Therefore, it results in more diverse matches betweenthe two sets of points, and provides a better measure of similarity between thetwo distributions. Additional example is presented in the supplementary.

Training with LCX guides the network to match between the two point sets,and as a result the underlying distributions become closer. In contrast, trainingwith CD, does not allow the two sets to get close. Indeed, we found empiri-cally that training with CD does not converge, hence, we excluded it from ourempirical reports.


Fig. 4. Minimizing the KL-divergence: The curves represents seven mea-sures of dissimilarity between the patch distribution of the target image y andthat of the generated image x, during training a super-resolution network (de-tails in supplementary) and using LCX as the objective. The correlation betweenLCX and all other measures, reported in square brackets, show highest correla-tion to the KL-divergence. It is evident that training with the Contextual lossresults in minimizing also the KL-divergence.

3 Empirical analysis

The Contextual loss has been proposed in [19] for measuring similarity betweennon-aligned images. It has therefore been used for applications such as styletransfer, where the generated image and the target style image are very differ-ent. In the current study we assert that the Contextual loss can be viewed asa statistical loss between the distributions of features. We further assert thatusing such a loss during training would lead to generating images with realisticcharacteristics. This would make it a good candidate also for tasks where thetraining image pairs are aligned.

To support these claims we present in this section two experiments, and inthe next section two real applications. The first experiment shows that mini-mizing the Contextual loss during training indeed implies also minimization ofthe KL-divergence. The second experiment evaluates the relation between theContextual loss and human perception of image quality.

Empirical analysis of the approximation Since we are proposing to usethe Contextual loss as an approximation to the KL-divergence, we next showempirically, that minimizing it during training also minimizes the KL-divergence.

To do this we chose a simplified super-resolution setup, based on SRRes-Net [9], and trained it with the Contextual loss as the objective. The details onthe setup are provided in the supplementary. We compute during the training theContextual loss, the KL-divergence, as well as five other common dissimilaritymeasures. To compute the KL-divergence, EMD and χ2, we need to approxi-mate the density of each image. As discussed in Section 2, it is not clear how themultivariate solution to KDE can be smoothly used here, therefore, instead, wegenerate a random projection of all 5×5 patches onto 2D and fit them using KDE


Fig. 5. How perceptual is the Contextual loss? Test on the 2AFC [26]data set with traditional and CNN-based distortions. We compare between low-level metrics (such as SSIM) and off-the-shelf deep networks trained with L2distance, Chamfer distance (CD) and the Contextual loss (LCX). LCX yields aperformance boost across all three networks with the largest gain obtained forVGG – the most common loss network (i.e., perceptual loss).

with a Gaussian kernel in 2D [25]. This was repeated for 100 random projectionsand the scores were averaged over all projections (examples of projections areshown in the supplementary).

Figure 4 presents the values of all seven measures during the iterations ofthe training. It can be seen that all of them are minimized during the itera-tions. The KL-divergence minimization curve is the one most similar to thatof the Contextual loss, suggesting that the Contextual loss forms a reasonableapproximation.

Evaluating on Human Perceptual Judgment Our ultimate goal is to trainnetworks that generate images with high perceptual quality. The underlyinghypothesis behind our approach is that training with an objective that maintainsnatural statistics will lead to this goal. To asses this hypothesis we repeatedthe evaluation procedure proposed in [26], for assessing the correlation betweenhuman judgment of similarity and loss functions based on deep features.

In [26] it was suggested to compute the similarity between two images bycomparing their corresponding deep embeddings. For each image they obtaineda deep embedding via a pre-trained network, normalized the activations and thencomputed the L2 distance. This was then averaged across spatial dimension andacross all layers. We adopted the same procedure while replacing the L2 distancewith the Contextual loss approximation to KL.

Our findings, as reported in Figure 5 (the complete table is provided inthe supplementary material), show the benefits of our proposed approach. TheContextual loss between deep features is more closely correlated with humanjudgment than L2 or Chamfer Distance between the same features. All of theseperceptual measures are preferable over low-level similarity measures such asSSIM [27].


4 Applications

In this section we present two applications: single-image super-resolution, andhigh-resolution surface normal estimation. We chose the first to highlight theadvantage of using our approach in concert with GAN. The second was selectedsince its output is not an image, but rather a surface of normals. This shows thegeneric nature of our approach to other domains apart from image generation,where GANs are not being used.

4.1 Single-Image Super-Resolution

To asses the contribution of our suggested framework for image restoration weexperiment on single-image super-resolution. To place our efforts in context westart by briefly reviewing some of the recent works on super-resolution. A morecomprehensive overview of the current trends can be found in [28].

Recent solutions based on CNNs can be categorized into two groups. Thefirst group relies on the L2 or L1 losses [9,29,30], which lead to high PSNRand SSIM [27] at the price of low perceptual quality, as was recently shownin [31,26]. The second group of works aim at high perceptual quality. This isdone by adopting perceptual loss functions [32], sometimes in combination withGAN [9] or by adding the Gram loss [33], which nicely captures textures [34].

Proposed solution: Our main goal is to generate natural looking images,with natural internal statistics. At the same time, we do not want the structuralsimilarity to be overly low (the trade-off between the two is nicely analyzedin [31,26]). Therefore, we propose an objective that considers both, with higherimportance to perceptual quality. Specifically, we integrate three loss terms: (i)The Contextual loss – to make sure that the internal statistics of the generatedimage are similar to those of the ground-truth high-resolution image. (ii) TheL2 loss, computed at low resolution – to drive the generated image to sharethe spatial structure of the target image. (iii) Finally, following [9] we add anadversarial term, which helps in pushing the generated image to look “real”.

Given a low-resolution image s and a target high-resolution image y, ourobjective for training the network G is:

L(G) = λCX · LCX(G(s), y) + λL2 · ||G(s)LF− yLF||2 + λGAN · LGAN(G(s)) (11)

where in all our experiments λCX =0.1, λGAN =1e−3, and λL2 =10. The imagesG(s)LF, yLF are low-frequencies obtained by convolution with a Gaussian kernelof width 21×21 and σ= 3. For the Contextual loss feature extraction we usedlayer conv3 4 of VGG19 [35].

Implementation details: We adopt the SRGAN architecture [9]1 andreplace only the objective. We train it on just 800 images from the DIV2Kdataset [36], for 1500 epochs. Our network is initialized by first training usingonly the L2 loss for 100 epochs.

1 We used the implementation in https://github.com/tensorlayer/SRGAN

https://github.com/tensorlayer/SRGAN


MethodLoss Distortion Perceptual # Training

function SSIM [27] NRQM [38] images

Johnson[32] LP 0.631 7.800 10KSRGAN[9] LGAN+LP 0.640 8.705 300KEnhanceNet[34] LGAN+LP +LT 0.624 8.719 200KOurs full LGAN+LCX+LLF

2 0.643 8.800 800

Ours w/o LLF2 LGAN+LCX 0.67 8.53 800

Ours w/o LCX LGAN+LLF2 0.510 8.411 800

SRGAN-MSE* LGAN+L2 0.643 8.4 800*our reimplementation

Table 1. Super-resolution results: The table presents the mean scoresobtained by our method and three others on BSD100 [37]. SSIM [27] mea-sures structural similarity to the ground-truth, while NRQM [38] measures no-reference perceptual quality. Our approach provides an improvement on bothscores, even though it required orders of magnitude fewer images for training.The table further provides an ablation test of the loss function to show that LCX

is a key ingredient in the quality of the results. LP is the perceptual loss [32],LT is the Gram loss [33], and both use VGG19 features.

Evaluation: Empirical evaluation was performed on the BSD100 dataset [37].As suggested in [31] we compute both structural similarity (SSIM [27]) to theground-truth and perceptual quality (NRQM [38]). The “ideal” algorithm willhave both scores high.

Table 1 compares our method with three recent solutions whose goal is highperceptual quality. It can be seen that our approach outperforms the state-of-the-art on both evaluation measures. This is especially satisfactory as we neededonly 800 images for training, while previous methods had to train on tens or evenhundreds of thousands of images. Note, that the values of the perceptual measureNRQM are not normalized and small changes are actually quite significant. Forexample the gap between [9] and [34] is only 0.014 yet visual inspection shows asignificant difference. The gap between our results and [34] is 0.08, i.e., almost6 times bigger, and visually it is significant.

Figure 6 further presents a few qualitative results, that highlight the gapbetween our approach and previous ones. Both SRGAN [9] and EnhanceNet [34]rely on adversarial training (GAN) in order to achieve photo-realistic results.This tends to over-generate high-frequency details, which make the image looksharp, however, often these high-frequency components do not match those ofthe target image. The Contextual loss, when used in concert with GAN, reducesthese artifacts, and results in natural looking image patches.

An interesting observation is that we achieve high perceptual quality while us-ing for the Contextual loss features from a mid-layer of VGG19, namely conv3 4.This is in contrast to the reports in [9] that had to use for SRGAN the high-layer conv5 4 for the perceptual loss (and failed when using low-layer such asconv2 2). Similarly, EnhanceNet required a mixture of pool2 and pool5.


Bicubic EnhanceNet[34] SRGAN[9] ours HR

Fig. 6. Super-resolution qualitative evaluation: Training with GAN and aperceptual loss sometimes adds undesired textures, e.g., SRGAN added wrinklesto the girl’s eyelid, unnatural colors to the zebra stripes, and arbitrary patternsto the ear. Our solution replaces the perceptual loss with the Contextual loss,thus getting rid of the unnatural patterns.

4.2 High-resolution Surface Normal Estimation

The framework we propose is by no means limited to networks that generateimages. It is a generic approach that could be useful for training networks forother tasks, where the natural statistics of the target should be exhibited inthe network’s output. To support this, we present a solution to the problem ofsurface normal estimation – an essential problem in computer vision, widely usedfor scene understanding and new-view synthesis.

Data: The task we pose is to estimate the underlying normal map from a singlemonocular color image. Although this problem is ill-posed, recent CNN basedapproaches achieve satisfactory results [40,41,42,43] on the NYU-v2 dataset [44].Thanks to its size, quality and variety, NYU-v2 is intensively used as a test-bedfor predicting depth, normals, segmentation etc. However, due to the acquisitionprotocol, its data does not include the fine details of the scene and misses theunderlying high frequency information, hence, it does not provide a detailedrepresentation of natural surfaces.

Therefore, we built a new dataset of images of surfaces and their correspond-ing normal maps, where fine details play a major role in defining the surfacestructure. Examples are shown in Figure 7. Our dataset is based on 182 dif-ferent textures and their respective normal maps that were collected from theInternet2, originally offered for usages of realistic interior home design, gaming,

2 www.poliigon.com and www.textures.com

www.poliigon.com

www.textures.com


Texture image Ground-truth CRN Ours (w/ L1) Ours (w/o L1)

Fig. 7. Estimating surface normals: A visual comparison of our method withCRN [39]. CRN trains with L1 as loss, which leads to coarse reconstruction thatmisses many fine details. Conversely, using our objective, especially without L1,leads to more natural looking surfaces, with delicate texture details. Interestingly,the results without L1 look visually better, even though quantitatively they areinferior, as shown in Table 2. We believe this is due to the usage of structuralevaluation measures rather than perceptual ones.

arts, etc. For each texture we obtained a high resolution color image (1024×1024)and a corresponding normal-map of a surface such that its underlying plane nor-mal points towards the camera (see Figure 7). Such image-normals pairs lack theeffect of lighting, which plays an essential role in normal estimation. Hence, weused Blender3, a 3D renderer, to simulate each texture under 10 different point-light locations. This resulted in a total of 1820 pairs of image-normals. Thetextures were split into 90% for training and 10% for testing, such that the testset includes all the 10 rendered pairs of each included texture.

The collection offers a variety of materials including wood, stone, fabric, steel,sand, etc., with multiple patterns, colors and roughness levels, that capture theappearance of real-life surfaces. While some textures are synthetic, they lookrealistic and exhibit imperfections of real surfaces, which are translated into thefine-details of both the color image as well as its normal map.

Proposed solution: We propose using an objective based on a combinationof three loss terms: (i) The Contextual loss – to make sure that the internalstatistics of the generated normal map match those of the target normal map.(ii) The L2 loss, computed at low resolution, and (iii) The L1 loss. Both drivethe generated normal map to share the spatial layout of the target map. Ouroverall objective is:

L(G) = λCX · LCX(G(s), y) + λL2 · ||G(s)LF − yLF||2 + λL1 · ||G(s)− y||1 (12)

3 www.blender.org

www.blender.org


where, λCX =1, and λL2 =0.1. The normal-maps G(s)LF, yLF are low-frequenciesobtained by convolution with a Gaussian kernel of width 21×21 and σ = 3. Wetested with both λL1 =1 and λL1 =0, which removes the third term.

Implementation details: We chose as architecture the Cascaded Refine-ment Network (CRN) [39] originally suggested for label-to-image and was shownto yield great results in a variety of other tasks [19]. For the contextual loss wetook as features 5×5 patches of the normal map (extracted with stride 2) andlayers conv1 2, conv2 2 of VGG19. In our implementation we reduced memoryconsumption by random sampling of all three layers into 65×65 features.

Evaluation: Table 2 compares our results with previous solutions. Wecompare to the recently proposed PixelNet [40], that presented state-of-the-art results on NYU-v2. Since PixelNet was trained on NYU-v2, which lacksfine details, we also tried fine-tuning it on our dataset. In addition, we presentresults obtained with CRN [39]. While the original CRN was trained with bothL1 and the perceptual loss [32], this combination provided poor results on normalestimation. Hence, we excluded the perceptual loss and report results with onlyL1 as the loss, or when using our objective of Eq. (12). It can be seen that CRNtrained with our objective (with L1) leads to the best quantitative scores.

Mean Median RMSE 11.25◦ 22.5◦ 30◦

Method (◦)↓ (◦)↓ (◦)↓ (%)↑ (%)↑ (%)↑PixelNet [40] 25.96 23.76 30.01 22.54 50.61 65.09PixelNet [40]+FineTune 14.27 12.44 16.64 51.20 85.13 91.81CRN [39] 8.73 6.57 11.74 74.03 90.96 95.28Ours (without L1) 9.67 7.26 12.94 70.39 89.84 94.73Ours (with L1) 8.59 6.50 11.54 74.61 91.12 95.28

Table 2. Quantitative evaluation of surface normal estimation (on ournew high-resolution dataset). The table presents six evaluation statistics, previ-ously suggested by [40,41], over the angular error between the predicted normalsand ground-truth normals. For the first three criteria lower is better, while forthe latter three higher is better. Our framework leads to the best results on allsix measures.

We would like to draw your attention to the inferior scores obtained whenremoving the L1 term from our objective (i.e., setting λL1 = 0 in Eq. (12)).Interestingly, this contradicts what one sees when examining the results visually.Looking at the reconstructed surfaces reveals that they look more natural andmore similar to the ground-truth without L1. A few examples illustrating this areprovided in Figures 7, 8. This phenomena is actually not surprising at all. In fact,it is aligned with the recent trends in super-resolution, discussed in Section 4.1,where perceptual evaluation measures are becoming a common assessment tool.Unfortunately, such perceptual measures are not the common evaluation criteriafor assessing reconstruction of surface normals.


CR

NO

urs

(wit

hL

1)

Ours

(w/oL

1)

GT

Fig. 8. Rendered images: To illustrate the benefits of our approach we rendernew view images using the estimated (or ground-truth) normal maps. Using ourframework better maintains the statistics of the original surface and thus resultswith more natural looking images that are more similar to the ground-truth.Again, as was shown in Figure 7, the results without L1 look more natural thanthose with L1.

Finally, we would like to emphasize, that our approach generalizes well toother textures outside our dataset. This can be seen from the results in Figure 1where two texture images, found online, were fed to our trained network. Thereconstructed surface normals are highly detailed and look natural.

5 Conclusions

In this paper we proposed using loss functions based on a statistical comparisonbetween the output and target for training generator networks. It was shown viamultiple experiments that such an approach can produce high-quality and state-of-the-art results. While we suggest adopting the Contextual loss to measure thedifference between distributions, other loss functions of a similar nature could(and should, may we add) be explored. We plan to delve into this in our futurework.Acknowledgements the Israel Science Foundation under Grant 1089/16 andby the Ollendorf foundation.

References

1. Ruderman, D.L.: The statistics of natural images. Network: computation in neuralsystems (1994) 2


2. Levin, A., Zomet, A., Weiss, Y.: Learning how to inpaint from global image statis-tics. In: ICCV. (2003) 2

3. Weiss, Y., Freeman, W.T.: What makes a good model of natural images? In:CVPR. (2007) 2

4. Roth, S., Black, M.J.: Fields of experts. IJCV (2009) 25. Zoran, D., Weiss, Y.: From learning models of natural image patches to whole

image restoration. In: ICCV, IEEE (2011) 26. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,

Courville, A., Bengio, Y.: Generative adversarial nets. In: NIPS. (2014) 27. Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with

deep convolutional generative adversarial networks. In: ICLR. (2015) 28. Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein gan. arXiv preprint

arXiv:1701.07875 (2017) 29. Ledig, C., Theis, L., Huszar, F., Caballero, J., Cunningham, A., Acosta, A., Aitken,

A., Tejani, A., Totz, J., Wang, Z., et al.: Photo-realistic single image super-resolution using a generative adversarial network. In: CVPR. (2017) 2, 7, 9,10, 11

10. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with con-ditional adversarial networks. In: CVPR. (2017) 2

11. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.:Improved techniques for training gans. In: NIPS. (2016) 2

12. Levin, A.: Blind motion deblurring using image statistics. In: Advances in NeuralInformation Processing Systems. (2007) 841–848 2

13. Aharon, M., Elad, M., Bruckstein, A.: K-svd: An algorithm for designing overcom-plete dictionaries for sparse representation. IEEE Transactions on signal processing(2006) 2

14. Mairal, J., Bach, F., Ponce, J., Sapiro, G.: Online learning for matrix factorizationand sparse coding. Journal of Machine Learning Research (2010) 2

15. Hassner, T., Basri, R.: Example based 3d reconstruction from single 2d images.In: CVPR Workshop. (2006) 2, 3

16. Michaeli, T., Irani, M.: Blind deblurring using internal patch recurrence. In:ECCV. (2014) 2

17. Glasner, D., Bagon, S., Irani, M.: Super-resolution from a single image. In: ICCV.(2009) 2

18. Zontak, M., Irani, M.: Internal statistics of a single natural image. In: CVPR.(2011) 2

19. Mechrez, R., Talmi, I., Zelnik-Manor, L.: The contextual loss for image transfor-mation with non-aligned data. (2018) 2, 3, 5, 6, 7, 13

20. Gal, R., Shamir, A., Hassner, T., Pauly, M., Cohen-Or, D.: Surface reconstructionusing local shape priors. In: Symposium on Geometry Processing. (2007) 3

21. Huang, J., Lee, A.B., Mumford, D.: Statistics of range images. In: CVPR. (2000)3

22. Barrow, H.G., Tenenbaum, J.M., Bolles, R.C., Wolf, H.C.: Parametric correspon-dence and chamfer matching: Two new techniques for image matching. Technicalreport (1977) 5, 6

23. Sun, X., Wu, J., Zhang, X., Zhang, Z., Zhang, C., Xue, T., Tenenbaum, J.B.,Freeman, W.T.: Pix3d: Dataset and methods for single-image 3d shape modeling.In: CVPR. (2018) 6

24. de Villiers, H.A., van Zijl, L., Niesler, T.R.: Vision-based hand pose estimationthrough similarity search using the earth mover’s distance. IET computer vision(2012) 6


25. Botev, Z.I., Grotowski, J.F., Kroese, D.P., et al.: Kernel density estimation viadiffusion. The annals of Statistics (2010) 8

26. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonableeffectiveness of deep features as a perceptual metric. (2018) 8, 9

27. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment:from error visibility to structural similarity. IEEE transactions on image processing(2004) 8, 9, 10

28. Timofte, R., Agustsson, E., Van Gool, L., Yang, M.H., Zhang, L., et al.: Ntire2017 challenge on single image super-resolution: Methods and results. In: CVPRWorkshops. (2017) 9

29. Lim, B., Son, S., Kim, H., Nah, S., Lee, K.M.: Enhanced deep residual networksfor single image super-resolution. In: CVPR Workshops. (2017) 9

30. Tong, T., Li, G., Liu, X., Gao, Q.: Image super-resolution using dense skip con-nections. In: CVPR. (2017) 9

31. Blau, Y., Michaeli, T.: The perception-distortion tradeoff. In: CVPR. (2018) 9,10

32. Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transferand super-resolution. In: ECCV. (2016) 9, 10, 13

33. Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutionalneural networks. In: CVPR. (2016) 9, 10

34. Sajjadi, M.S., Scholkopf, B., Hirsch, M.: Enhancenet: Single image super-resolutionthrough automated texture synthesis. In: ICCV. (2017) 9, 10, 11

35. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scaleimage recognition. arXiv preprint arXiv:1409.1556 (2014) 9

36. Agustsson, E., Timofte, R.: Ntire 2017 challenge on single image super-resolution:Dataset and study. In: CVPR Workshops. (2017) 9

37. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented naturalimages and its application to evaluating segmentation algorithms and measuringecological statistics. In: ICCV, IEEE (2001) 10

38. Ma, C., Yang, C.Y., Yang, X., Yang, M.H.: Learning a no-reference quality met-ric for single-image super-resolution. Computer Vision and Image Understanding(2017) 10

39. Chen, Q., Koltun, V.: Photographic image synthesis with cascaded refinementnetworks. In: ICCV. (2017) 12, 13

40. Bansal, A., Chen, X., Russell, B., Ramanan, A.G., et al.: Pixelnet: Representationof the pixels, by the pixels, and for the pixels. In: CVPR. (2016) 11, 13

41. Wang, X., Fouhey, D., Gupta, A.: Designing deep networks for surface normalestimation. In: CVPR. (2015) 11, 13

42. Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels witha common multi-scale convolutional architecture. In: Proceedings of the IEEEInternational Conference on Computer Vision. (2015) 2650–2658 11

43. Chen, W., Xiang, D., Deng, J.: Surface normals in the wild. In: ICCV. (2017) 1144. Nathan Silberman, Derek Hoiem, P.K., Fergus, R.: Indoor segmentation and sup-

port inference from rgbd images. In: ECCV. (2012) 11

maintaining natural image statistics with the contextual...

Documents