a perceptual quantization strategy for hevc based on a...

A perceptual quantization strategy for HEVC based on aconvolutional neural network trained on natural images

Md Mushfiqul Alam∗a, Tuan D. Nguyena, Martin T. Hagana, and Damon M. Chandlerb

aSchool of Electrical and Computer Engineering, Oklahoma State University, Stillwater, OK,USA; bDept. of Electrical and Electronic Engineering, Shizuoka University, Hamamatsu, Japan

ABSTRACT

Fast prediction models of local distortion visibility and local quality can potentially make modern spatiotem-porally adaptive coding schemes feasible for real-time applications. In this paper, a fast convolutional-neural-network based quantization strategy for HEVC is proposed. Local artifact visibility is predicted via a networktrained on data derived from our improved contrast gain control model. The contrast gain control model wastrained on our recent database of local distortion visibility in natural scenes [Alam et al. JOV 2014]. Further-more, a structural facilitation model was proposed to capture effects of recognizable structures on distortionvisibility via the contrast gain control model. Our results provide on average 11% improvements in compressionefficiency for spatial luma channel of HEVC while requiring almost one hundredth of the computational time ofan equivalent gain control model. Our work opens the doors for similar techniques which may work for differentforthcoming compression standards.

Keywords: HEVC, distortion visibility, convolutional neural network, natural images, visual masking

1. INTRODUCTION

Most digital videos have undergone some form of lossy coding, a technology which has enabled the transmissionand storage of vast amounts of videos with increasingly higher resolutions. A key attribute of modern lossy codingis the ability to trade-off the resulting bit-rate with the resulting visual quality. The latest coding standard, highefficiency video coding (HEVC),1 has emerged as an effective successor to H.264, particularly for high-qualityvideos. In this high-quality regime, an often-sought goal is to code each spatiotemporal region with the minimumnumber of bits required to keep the resulting distortions imperceptible. Achieving such a goal, however, requiresthe ability to accurately and efficiently predict the local visibility of coding artifacts, a goal which currentlyremains elusive.

The most common technique of estimating distortion visibility is to employ a model of just noticeable distor-tion (JND);2–4 and such JND-type models have been used to design improved quantization strategies for use inimage coding.5–8 However, these approaches are based largely on findings from highly controlled psychophysicalstudies using unnatural videos. For example, Peterson et al.9 performed an experiment to determine visualsensitivity to DCT basis functions in the RGB and YCrCb color spaces; and, these results have been used todevelop distortion visibility models.5,10–12 However, the presence of a video could induce visual masking, thuschanging the visual sensitivity to the DCT basis functions.

Several researchers have also employed distortion visibility models specifically for HEVC and H.264. Nac-cari and Pereira proposed a just-noticeable-difference (JND) model for H.264 by considering visual masking inluminance, frequency, pattern, and temporal domains.13 However, the coding scheme was based on Foley andBoynton’s14 contrast gain-control model which did not consider the gain control in V1 neurons. In anotherstudy, Naccari et al. proposed a quantization scheme for HEVC to capture luminance masking based on mea-surements of pixel intensities.15 Blasi et al. proposed a quantization scheme for HEVC which took into accountvisual masking by selectively discarding frequency components of the residual signal depending on the input pat-terns.16 However, the scheme of frequency-channel discarding was not based on psychophysical measurements.Other researchers, for example Yu et al.17 and Li et al.,18 have used saliency information, rather than visualmasking to achieve higher compression for HEVC.

*[email protected]; phone: 1 718 200-8502; vision.okstate.edu/masking

Applications of Digital Image Processing XXXVIII, edited by Andrew G. Tescher,Proc. of SPIE Vol. 9599, 959918 · © 2015 SPIE · CCC code: 0277-786X/15/$18

doi: 10.1117/12.2188913

Proc. of SPIE Vol. 9599 959918-1

Downloaded From: http://proceedings.spiedigitallibrary.org/ on 05/07/2016 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx

Thus, one potential area for improved coding is to develop and employ a visibility model which is not justdesigned for naturalistic masks (natural videos), but also designed to adapt to individual video characteristics(see, e.g., Refs. 19–23, which aim toward this adaptive approach for images). Another area for improvement isin regards to runtime performance. Operating fuller distortion visibility models in a block-based fashion, so asto achieve local predictions, is often impractical due to prohibitive computational and memory demands. Thus,predicting local distortion visibility in a manner which is simultaneously more accurate and more efficient thanexisting models remains an open research challenge.

Here, we propose an HEVC-based quantization scheme based on a fast model of local artifact visibilitydesigned specifically for natural videos. The model uses a convolutional-neural-network (CNN) architecture forpredicting local artifact visibility and quality of each spatiotemporal region of a video. We trained the CNNmodel using our recently published database of local masking and quality in natural scenes.24,25 When coupledwith an HEVC encoder, our results provided 11% more compression over baseline HEVC when matched in visualquality, while requiring one hundredth the computational time of an equivalent gain control model. Our workopens the doors for similar techniques which may work for different forthcoming compression standards.

2. DISTORTION VISIBILITY PREDICTION MODEL-1: CONTRAST GAINCONTROL WITH STRUCTURAL FACILITATION (CGC+SF)

This paper describes two separate models for predicting distortion visibility: (1) A model of masking whichoperates by simulating V1 neural responses with contrast gain control (CGC); and (2) a model which employs aconvolutional neural network (CNN). The training data of the CNN model are derived from the CGC model.

Contrast masking26 has been widely used for predicting distortion visibility in images and videos.5,11,12,27

Among the many existing models of contrast masking, those which simulate the contrast gain-control responseproperties of V1 neurons are most widely used. Although several gain control models have been proposedin previous studies (e.g., Refs. 26, 28–31), in most cases, the model parameters are selected based on resultsobtained using either unnatural masks31 or only a very limited number of natural images.20,22 Here, we haveused the contrast masking model proposed by Watson and Solomon.31 This model is well-suited for use a genericdistortion visibility predictor because: (1) the model mimics V1 neurons and is therefore theoretically distortion-type-agnostic; (2) the model parameters can be tuned to fall within biologically plausible ranges; (3) the modelcan directly take images as inputs, rather than features of images as inputs; and (4) the model can operate in aspatially localized fashion.

We improved upon the CGC model in two ways. First, we optimized the CGC model by using our recentlydeveloped dataset of local masking in natural scenes24 (currently, the largest dataset of its kind). Second, weincorporated into the CGC model a structural facilitation model which better captures the reduced maskingobserved in structured regions.

2.1 Watson-Solomon31 contrast gain control (CGC) model

This subsection briefly describes the Watson and Solomon contrast gain-control model.31 Figure 1 shows theWatson and Solomon model flow. Both the mask (image) and mask+target (distorted image) go through the samestages: a spatial filter to simulate the contrast sensitivity function (CSF), a log-Gabor decomposition, excitatoryand inhibitory nonlinearities, inhibitory pooling, and division.31 After division, the subtracted responses of themask and mask+target are pooled via Minkowski pooling.31 The target contrast is changed iteratively until theMinkowski-pooled response equals a pre-defined “at threshold” decision (d), at which point, the contrast of thetarget is deemed the final threshold.

The Watson and Solomon model is a model of V1 simple-cell responses (after “Divide” in Figure 1). Letz(x0, y0, f0, θ0) denote the initial linear modeled neural response (log-Gabor filter output) at location (x0, y0),center radial frequency f0, and orientation θ0. In the model, the response of a neuron (after division) tuned tothese parameters is given by

r(x0, y0, f0, θ0) = g · z(x0, y0, f0, θ0)p

bq +∑

(x,y,f,θ)∈IN

(z(x0, y0, f0, θ0))q , (1)



CSF Log-

Gabor

Excitatory

Nonlinearity

Inhibitory

Nonlinearity

Inhibitory

Pooling

Divide

CSF Log-

Gabor

Excitatory

Nonlinearity

Inhibitory

Nonlinearity

Inhibitory

Pooling

Divide

Subtract Minkowski

Pool

Change

Target

Contrast

No

Yes

Mask

+

Target

Mask

Calculate

Final

Threshold

?ˆ dd »

Figure 1. Flow of Watson and Solomon contrast gain control model.31 For description of the model please see section 2.1.

where g is the output gain,22 p provides the point-wise excitatory nonlinearity, q provides the point-wise inhibitorynonlinearity from the neurons in the inhibitory pool, IN indicates the set of neurons which are included in theinhibitory pool, and b represents the semi-saturation constant.

Watson and Solomon31 optimized their model parameters using results of Foley and Boynton’s14 maskingexperiment using sine-wave gratings and Gabor patterns. However, as mentioned previously, we have recentlydeveloped a large dataset of local masking in natural scenes.24 Here, this dataset was used with a brute-forcesearch to determine the optimum model parameters within biologically plausible ranges.22,32 The parameters ofthe optimized gain control model are given in Table 1.

Table 1. Parameters of the optimized gain control model. Parameters with (*) superscript were varied. Values of theparameters without (*) superscript were found from Watson and Solomon’s31 work and Chandler et al.’s22 work. Detaileddescription of the functions can be found in our previous work.24

Symbol Description Value

Contrast sensitivity filter33,34

fCSF Peak frequency 6.0 c/degBWCSF log10 bandwidth 1.43

Log-Gabor decompositionnf

∗ Number of frequency bands 6BWf

∗ Frequency channel bandwidth 2.75 octavef0G

∗ Center radial frequencies of the bands 0.3, 0.61, 1.35, 3.22, 7.83, 16.1 c/degnθ Number of orientation channels 6BWθ Orientation channel bandwidth 300

θ0G Center angles of orientation channels 00, ±300, ±600, 900

Divisive gain controlp,q∗ Excitatory and inhibitory exponents 2.4, 2.35b Semi-saturation constant 0.035g∗ Output gain 0.1sx,y Inhibitory pool kernel: space 3× 3 Gaussian, zero mean, 0.5 standard deviation24

sf Inhibitory pool kernel: frequency ±0.7 octave bandwidth with equal weights24

sθ Inhibitory pool kernel: orientation ±600 bandwidth with equal weights24

βr,βf ,βθ Minkowski exponents: space, 2.0, 1.5, 1.5frequency, and orientation

d “At threshold”decision 1.0

2.2 Structural facilitation (SF) model

With the abovementioned parameter optimizations, the Watson and Solomon gain control model performedfairly well in predicting the detection thresholds from our local masking database; a Pearson correlation coef-ficient of 0.83 was observed between the ground-truth thresholds and the model predictions. However, the themodel consistently overestimated the thresholds for (underestimated the visibilities of) distortions in regionscontaining recognizable structures. For example, Figure 2 shows how the optimized Watson and Solomon modeloverestimates the thresholds near the top of the gecko’s body, and in the child’s face.



Optimized

contrast gain control

Experiment distortion

visibility map

geckos

Wrong

predictions

20 40 60 80 100

20

40

60

80

100 -45

-40

-35

-30

-25

child_

swimming

20 40 60 80 100

20

40

60

80

100 -45

-40

-35

-30

-25

-20

Wrong

predictions

Figure 2. The optimized Watson and Solomon model overestimated detection thresholds for distortion in image regionscontaining recognizable structures (e.g., near the top of the gecko’s body, and in the child’s face). The distortion visibilitymaps show the contrast detection thresholds in each spatial location of the images in terms of RMS contrast.35 Thecolorbars on the right denote the ranges of distortion visibilities for the corresponding image.

2.2.1 Facilitation via reduced inhibition

Figure 2 suggests that recognizable structures within the local regions of natural scenes facilitate (rather thanmask) the distortion visibility. This observation also occurred in several other images in our dataset (see Ref.24 for a description of the observed facilitation). Here, to model this “structural facilitation,” we employ aninhibition multiplication factor (λs) in the gain control equation:

r(x0, y0, f0, θ0) = g · z(x0, y0, f0, θ0)p

bq + λs

∑(x,y,f,θ)∈IN

(z(x0, y0, f0, θ0))q . (2)

The value of λs varies depending on the strength of structure within an image. Figure 3 shows the change of

0 0.02 0.04 0.06 0.080

0.2

0.4

0.6

0.8

1

Structure, �

Inhib

itio

n m

ult

ipli

er, !

Strong structure

(lower neuron

inhibition)

Weak structure

(higher neuron

inhibition)

Figure 3. The inhibition multiplier λs varies depending on structure strength. Strong structures give rise to lower inhibitionto facilitate the distortion visibility.



ir.

inhibition in V1 simple cells depending on the strength of a structure. Recent cognitive studies36 have shownactive feedback connections to V1 neurons coming from higher level cortices. Our assumption is that recognitionof a structure results in lower inhibition in V1 simple cells via increased feedback from higher levels. For the ith

block of an image, λsi is calculated via the following sigmoid function:

λsi =

1− 80∑M,N

x,y=1,1

[1/(1 + exp

(− Si(x,y)−p(S,80)

0.005

))], max(S) > 0.04 & Kurt(S) > 3.5

1, otherwise,(3)

where S denotes a structure map of an image, Si denotes ith block of S, M and N denote the block width and

height, and p(S, 80) denotes the 80th percentile of S. The maximum value and kurtosis of the structure map Sare used to determine whether there is localized strong structure within the image. The values of the parametersof equation 3 were determined empirically.

2.2.2 Structure detection

The structure map S of an image is generated via the following equation which uses different feature maps:

S = Ln × Shn × En × (1−Dµn)2 × (1−Dσn)

2. (4)

Ln denotes a map of local luminance. Shn denotes a map of local sharpness.37 En denotes a map of localShannon (first-order) entropy. Dµn and Dσn denote maps of the average and standard deviation of, respectively,fractal texture features38 computed for each local region. Each of these features was measured for 32× 32-pixelblocks with 50% overlap between neighboring blocks. Each map was then globally normalized so as to bring allvalues within the range 0 to 1, and then resized by using bicubic interpolation to match the dimensions of theinput image. Examples of structure maps computed for two images are shown in Figure 4.

geckos structure map, � child_swimming structure map, �

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.090.09

0.05

0.07

0.0

0.03

0.02

0.04

0.06

0.08

0.01

Figure 4. Structure maps of two example images. The colorbar at right denotes the structure strength at each spatiallocation of the structure map.

2.2.3 Performance of the combined (CGC+SF) model

By incorporating the SF model with the optimized CGC model (as specified in Equation (2)), we were able toimprove the model’s predictions for regions of images containing recognizable structures, while not adverselyaffecting the predictions for other regions. Figure 5 shows some results of using the combined (CGC+SF)model. Observe that the structural facilitation improved the prediction performance in local regions of imagescontaining recognizable structures. For example, in the geckos image, near the top gecko’s skin, the predictionsusing the CGC+SF model better match the ground-truth thresholds as compared to using only the CGC model.Furthermore, the Pearson correlation coefficients between the CGC+SF model predictions and ground-truththresholds also improved as compared to using the CGC model alone.

3. DISTORTION VISIBILITY PREDICTION MODEL-2: CONVOLUTIONALNEURAL NETWORK (CNN) MODEL

In terms of accuracy, the CGC+SF model performs quite well at predicting local distortion visibilities. However,this prediction accuracy comes at the cost of heavy computational and memory demands. In particular, the



Pet'_

i'

CGC: Optimized


Experiment distortion

visibility map

geckos

child_

swimming

PCC: 0.68 PCC: 0.77

PCC: 0.85 PCC: 0.87

foxy

20 40 60 80 100

20

40

60

80

100

-35

-30

-25

-20

PCC: 0.58 PCC: 0.63

couple

20 40 60 80 100

20

40

60

80

100

-40

-35

-30

-25

-20

-15

PCC: 0.55 PCC: 0.60

20 40 60 80 100

20

40

60

80

100

-35

-30

-25

-20

-15

20 40 60 80 100

20

40

60

80

100

-40

-35

-30

-25

-20

-15

CGC+SF: Optimized


(with structural facilitation)

Figure 5. Structural facilitation improves the distortion visibility predictions in local regions of images containing rec-ognizable structures. Pearson correlation coefficient (PCC) of each map with the experiment map is shown below themap.

CGC+SF model uses a filterbank consisting of 20 log-Gabor filters yielding complex-valued responses to mimicon-center and off-center neural responses. This filtering is quite expensive in terms of memory requirements,which easily exceed the cache on modern processors.39 In addition, the computation of the CGC responses isexpensive, particularly considering that an iterative search procedure is required for the CGC+SF model topredict the thresholds for each block.

One way to overcome these issues, which is the approach adopted in this paper, is to use a model based on aconvolutional neural network40 (CNN). The key advantages of a CNN model are two-fold. First, a CNN modeldoes not require the distorted (reconstructed) image as an input; the predictions can be made based solely onthe original image. The distorted images are required only when training the CNN, and it is during this stagethat characteristics of particular compression artifacts are learned by the model. The second advantage is thata CNN model does not require a search procedure for its prediction.

3.1 CNN Network architecture

In a recent work we have proposed a CNN architecture (VNet-1)41 to predict local visibility thresholds for Gabornoise distortions in natural scenes. For VNet-1 the input patch dimension was 85 × 85 pixels (approximately1.9◦ of visual angle), and the training data came from our aforementioned masking dataset.24

In this paper, we propose a modified CNN architecture (VNet-2) for predicting local visibility thresholdsfor HEVC distortions for individual frames of a video. Figure 6 shows the network architecture. The networkcontains three layers, C1: Convolution, S2: Subsampling, and F3: Full connection. Each layer contains onefeature map. The convolution kernel size of the C1 layer is 19×19. The subsampling factor in layer S2 is 2. Thedimensions of the input patch and feature maps of VNet-2 are M = N = 64, d1r = d1c = 46, d2r = d2c = 23.



Layer C1:

Convolution

Layer F3:

Full connection

Layer S2:

Subsampling

Input patch

(� ×�) Feature map

( !"× #

") Feature map

( !$× #

$) Output

(distortion visibility) n (distortion

Figure 6. The network architecture (VNet-2) used to predict the detection thresholds. For description of the architectureplease see section 3.1.

3.2 CNN training parameters

VNet-2 contains a total of 894 trainable parameters: The C1 layer contains 362 trainable parameters (19 × 19kernel + 1 bias). The S2 layer contains 2 (1 weight + 1 bias) trainable parameters. The F3 layer contains 530(23× 23 weight + 1 bias) trainable parameters.

For training VNet-2, the ground-truth dataset was created using the outputs of the CGC+SF model describedin Section 2. Specifically, the CGC+SF model was used to predict a visibility threshold for each 64× 64 block ofthe 30 natural scenes from the CSIQ dataset,34 and these predicted thresholds were then used as ground-truthfor training the CNN model. Clearly, additional prediction error is imposed by using the CGC+SF model totrain the CNN model. However, our masking dataset from Ref. 24 contained thresholds for 85 × 85 blocks,whereas thresholds for 64× 64 blocks are needed for use in HEVC. Our CNN should thus be regarded as a fastapproximation of the CGC+SF model.

Only the luma channel of each block was distorted by changing the HEVC quantization parameter QP from 0to 51. The visibility threshold (contrast detection threshold CT ) and the corresponding quantization parameterthreshold (QPT ) were calculated for each of the 1920 patches using the CGC+SF model. Moreover, the number ofpatches was increased by a factor of four (to a total of 7680 patches) by horizontally and/or vertically flipping theoriginal patches. Flipping a patch would not, in principle, change the threshold for visually detecting distortionswithin that patch; however, in terms of the network, flipping a patch provides a different input pattern, and thusfacilitates learning.

We used 70% of the data for training, 15% for validation, and 15% for testing. The data were randomlyselected from each image to avoid image bias. A committee of 10 networks was trained via the Flexible-Image-Recognition-Software-Toolbox42 using the resilient backpropagation algorithm;43 the results from the 10 networkswere averaged to predict the thresholds.

3.3 Prediction performance of CNN model

The CNN model provided reasonable predictions the thresholds from the CGC+SF model. Figure 7 shows thescatter plots between the CGC+SF thresholds and the CNN predictions. The Pearson correlation coefficientswere 0.95 and 0.93 for training+validation and testing data, respectively. From Figure 7, observe that the scatterplots are saturating at both ends of the visibility threshold range. Below −40 dB the patches are very smooth,and above 0 dB the patches are very dark. In both cases, the patches are devoid of visible content, and thus thecoding artifacts would be imperceptible if multiple blocks are coded similarly.

3.4 QP prediction using neural network

The CNN model predicts the visibility threshold (detection threshold CT ) for each patch. To use the modelin an HEVC encoder, we thus need to predict the HEVC quantization parameter QP such that the resultingdistortions exhibit a contrast at or below CT . Note the quantization step (Qstep) in HEVC relates to QP via44

Qstep = (21/6)QP−4. (5)



-60 -40 -20 0 20-60

-50

-40

-30

-20

-10

0

10

20

-60 -40 -20 0 20-60

-50

-40

-30

-20

-10

0

10

20(a) Training + Validation

PCC: 0.95 (b) Testing, PCC: 0.93

Ground truth distortion visibilities: Outputs of CGC+SF model (dB)

CN

N p

redic

tions

(dB

)

Figure 7. Scatter plots between CGC+SF thresholds and CNN predictions. Example patches at various levels of distortionvisibilities are also shown using arrows.

-70 -60 -50 -40 -30 -20 -10 00

50

100

150

200

250

Patch 1

Patch 2

Patch 3

-70 -60 -50 -40 -30 -20 -10 0-1

0

1

2

3

4

5

6

RMS contrast (�) of HEVC distortion (dB)

(a) (b)

Figure 8. (a) The quantization step (Qstep) has an exponential relationship with the contrast (C) of the HEVC distortion;(b) log(Qstep) has a quadratic relationship with C.

Figure 8 shows plots of (a) Qstep and (b) log(Qstep) as a function of the contrast (C) of the induced HEVC dis-tortion. Note that Qstep shows an exponential relationship with C, and log(Qstep) shows a quadratic relationshipwith C. Thus, we use the following equation to predict log(Qstep) from C:

log(Qstep) = αC2 + βC + γ, (6)

where α, β, and γ are model coefficients which change depending on patch features. The coefficients α, β, and γcan be reasonably well-modeled by using four simple features of a patch: (i) the Michelson luminance contrast,(ii) the RMS luminance contrast, (iii) the standard deviation of luminance, and (iv) the y-axis intercept of a plotof the patch’s log-magnitude spectrum. Descriptions of these features are available in the “Analysis” section ofour masking dataset.24

To predict α, β, and γ from the patch features, we designed three separate committees of neural networks.Each committee had a total of five two-layered feed-forward networks, and each network contained 10 neurons.Each of the networks were trained by using a scaled conjugate gradient algorithm with 70% training, 15%validation, and 15% test data.

Figure 9 shows the scatter plots of α, β, and γ. In the x-axis values of α, β, γ found from fitting the equation6 are shown, and in the y-axis α, β, γ predictions from the feed-forward networks are shown. Observe that the



0 0.5 1 1.50

0.5

1

1.5

0 10 20 30 400

5

10

15

20

25

30

35

40

0 0.002 0.004 0.006 0.008 0.01 0.0120

0.002

0.004

0.006

0.008

0.01

0.012

(a) � (PCC: 0.96) (b) (PCC: 0.97) (c) ! (PCC: 0.97)

�, , ! found from fitting the quadratic equation: log "#$%& = �'( + ' + !

pre

dic

tio

n f

rom

pat

ch f

eatu

res

Figure 9. Scatter plots of (a) α, (b) β, and (c) γ. The x-axis shows α, β, and γ found by fitting Equation (6). The y-axisα, β, and γ are predictions from the feed-forward networks.

patch features predict α, β, and γ reasonably well, yielding Pearson correlation coefficients of 0.96, 0.97, and0.97, respectively.

The prediction process of the quantization parameter QP for an input image patch is thus summarized asfollows:

• First, predict the distortion visibility (CT ) using the committee of VNet-2.

• Second, calculate four features of the patch: (i) Michelson contrast, (ii) RMS contrast, (iii) Standarddeviation of patch luminance, and (iv) y-axis cut point of log-magnitude spectra.

• Third, using the four features as inputs predict α, β, γ from the committees of feed-forward networks.

• Fourth, calculate log(Qstep) using equation 6.

• Finally, calculate QP using equation:

QP = max

(min

(round

[log(Qstep)

log(21/6)+ 4

], 51

), 0

). (7)

Furthermore, we have noticed that when CT < −40 dB the patches are very smooth, and thus can be codedwith higher QP . Thus, for both the CGC+SF model and CNN model, QP was set to 30 for the patches withCT < −40 dB.

4. RESULTS

In this section we briefly present our preliminary results.

4.1 QP maps

Figure 10 shows the QP maps for two 512× 512 images. The QP maps are generated from CGC+SF and CNNmodels. Note that the QP maps show how the bits get allocation during encoding process. For example, thedark regions in the redwood tree (in redwood image) and the bush near the wood-log (in the log seaside image)show higher QP values, denoting those regions will be coded coarsely compared to other regions.



;:ñ

r ßC

�� map from CGC+SF �� map from CNN

redwood

log_

seaside

0

10

20

30

40

50

Figure 10. The quantization parameter (QP ) map for two images coming from CGC+SF and CNN models.

4.2 Compression performance

Both the CGC+SF and CNN models take the 64 × 64-pixels image patch as input, and predict the distortionvisibility (CT ) and corresponding threshold QP . Preliminarily we have used the QP maps to test the compressionperformance using sample images from the CSIQ dataset.34

While coding a HEVC coding tree unit, we searched for the minimum QP around the spatial location of thatunit within the image, and assigned the minimum QP to that unit. Figure 11 shows example images displayingthe compression performance. Note that coding using QP map essentially only add near or below-thresholddistortions, and thus judging the quality of the images is quite difficult. However, for comparing with thereference HEVC, a visual matching experiment was performed by two experienced subjects. The purpose of theexperiment was to find at which QP, the reference software coded image matches with the same quality of theCGC+SF model coded image. After the visually equivalent reference coded image was found, the correspondingbits-per-pixel (bpp) was recorded. In Figure 11 the first column shows the reference images. The second columnshows the reference software coded images which are visually equivalent to the images coded using the CGC+SFmap. The third column shows the images coded using CGC+SF QP map, and the fourth column shows theimages coded using CNN QP map.

Although for near/below threshold distortions subjective experiments are more reliable for assessing thequality, we have reported the structural-similarity index (SSIM)45 between the reference images and the codedimages. Note the qualities of the images are very close. The bpp and compression gains are also shown belowthe images. On average using QP maps from the CGC+SF and CNN models yielded 15% and 11% gains,respectively. It should be noted that we generated the QP maps mainly from contrast masking and structuralfacilitation. Thus, if any image does not contain enough local region to mask the distortions, using the QP mapyields no gain.

4.3 Execution-time of QP map prediction

One of the reason for introducing a CNN model after the CGC+SF model was that CGC+SF model is too slowto be considered for a real-time application. The execution times of the two models were calculated using acomputer having Intel(R) Core(TM)2 Quad CPU Q9400 2.67 GHz processor, 12 GB of RAM, NVIDIA QuadroK5000 graphics card, and running Matlab2014a. The execution times of the two models are calculated to generateQP maps for 512× 512 pixels images with 64× 64 blocks. The execution times for the CGC+SF model and theCNN model were 239.83 ± 30.564 sec and 2.24 ± 0.29 sec, respectively. Thus, on average the CNN model was106 times faster compared to the CGC+SF model.

Proc. of SPIE Vol. 9599 959918-10


t

' ñ.

a

':9

`-,V,:w

SSIM: 0.96

bpp: 2.43

SSIM: 0.94

bpp: 1.69, gain 18.3%

SSIM: 0.92

bpp: 1.35, gain 35.2%

SSIM:0.89

bpp: 2.13

SSIM: 0.88

bpp: 1.82, gain 14.9%

SSIM: 0.88

bpp: 2.12, gain 0.5%

SSIM: 0.94,

bpp: 3.01

SSIM: 0.94

bpp: 2.60, gain 13.7%

SSIM: 0.94

bpp: 2.58, gain 14.3%

SSIM: 0.96

bpp: 2.49

SSIM: 0.94

bpp: 2.25, gain 9.6%

SSIM: 0.94

bpp: 2.39, gain 3.9%

SSIM: 0.82

bpp: 2.12

SSIM: 0.82

bpp: 1.69, gain 20.2%

SSIM: 0.84

bpp: 2.03, gain 4.3%

Reference image

Visually equivalent

image

Coded using �� map

For CGC+SF model

Coded using �� map

For CNN model

Figure 11. Compression performance by using the QP map generated from CGC+SF and CNN models.

5. CONCLUSIONS AND FUTURE WORKS

This paper described a fast convolutional-neural-network based quantization strategy for HEVC. Local artifactvisibility was predicted via a network trained on data derived from an improved contrast gain control model. The

Proc. of SPIE Vol. 9599 959918-11


contrast gain control model was trained on our recent database of local distortion visibility in natural scenes.24

Furthermore, to incorporate the effects of recognizable structures on distortion visibility, a structural facilitationmodel was proposed which was incorporate with the contrast gain control model. Our results provide on average11% improvements in compression efficiency for HEVC using the QP map derived from the CNN model, whilerequiring almost one hundredths of the computational time of the CGC+SF model. Future avenue of our workincludes computational time analysis for predicting distortion visibility in real-time videos.

REFERENCES

[1] Draft, I., “recommendation and final draft international standard of joint video specification (itu-t rec. h.265— iso/iec 23008-2 hevc),” (Nov. 2013).

[2] Lubin, J., “Sarnoff jnd vision model: Algorithm description and testing,” Presentation by Jeffrey Lubin ofSarnoff to T1A1 5 (1997).

[3] Winkler, S., “Vision models and quality metrics for image processing applications,” These de doctorat, EcolePolytechnique Federale de Lausanne (2000).

[4] Watson, A. B., Hu, J., and McGowan, J. F., “Digital video quality metric based on human vision,” Journalof Electronic imaging 10(1), 20–29 (2001).

[5] Watson, A. B., “Dctune: A technique for visual optimization of dct quantization matrices for individualimages,” in [Sid International Symposium Digest of Technical Papers ], 24, 946–949, Society for InformationDisplay (1993).

[6] Chou, C.-H. and Li, Y.-C., “A perceptually tuned subband image coder based on the measure of just-noticeable-distortion profile,” Circuits and Systems for Video Technology, IEEE Transactions on 5(6), 467–476 (1995).

[7] Hontsch, I. and Karam, L. J., “Adaptive image coding with perceptual distortion control,” Image Processing,IEEE Transactions on 11(3), 213–222 (2002).

[8] Chou, C.-H., Chen, C.-W., et al., “A perceptually optimized 3-d subband codec for video communicationover wireless channels,” Circuits and Systems for Video Technology, IEEE Transactions on 6(2), 143–156(1996).

[9] Peterson, H. A., Peng, H., Morgan, J., and Pennebaker, W. B., “Quantization of color image components inthe dct domain,” in [Electronic Imaging’91, San Jose, CA ], 210–222, International Society for Optics andPhotonics (1991).

[10] Ahumada Jr, A. J. and Peterson, H. A., “Luminance-model-based dct quantization for color image com-pression,” in [SPIE/IS&T 1992 Symposium on Electronic Imaging: Science and Technology ], 365–374,International Society for Optics and Photonics (1992).

[11] Zhang, X., Lin, W., and Xue, P., “Improved estimation for just-noticeable visual distortion,” Signal Pro-cessing 85(4), 795–808 (2005).

[12] Jia, Y., Lin, W., Kassim, A., et al., “Estimating just-noticeable distortion for video,” Circuits and Systemsfor Video Technology, IEEE Transactions on 16(7), 820–829 (2006).

[13] Naccari, M. and Pereira, F., “Advanced h. 264/avc-based perceptual video coding: Architecture, tools, andassessment,” Circuits and Systems for Video Technology, IEEE Transactions on 21(6), 766–782 (2011).

[14] Foley, J. M. and Boynton, G. M., “New model of human luminance pattern vision mechanisms: analysis ofthe effects of pattern orientation, spatial phase, and temporal frequency,” in [Proceedings of SPIE ], 2054,32 (1994).

[15] Naccari, M. and Mrak, M., “Intensity dependent spatial quantization with application in hevc,” in [Multi-media and Expo (ICME), 2013 IEEE International Conference on ], 1–6, IEEE (2013).

[16] Blasi, S. G., Mrak, M., and Izquierdo, E., “Masking of transformed intra-predicted blocks for high qualityimage and video coding,” in [Image Processing (ICIP), 2014 IEEE International Conference on ], 3160–3164,IEEE (2014).

[17] Yu, Y., Zhou, Y., Wang, H., Tu, Q., and Men, A., “A perceptual quality metric based rate-quality opti-mization of h. 265/hevc,” in [Wireless Personal Multimedia Communications (WPMC), 2014 InternationalSymposium on ], 12–17, IEEE (2014).

Proc. of SPIE Vol. 9599 959918-12


[18] Li, Y., Liao, W., Huang, J., He, D., and Chen, Z., “Saliency based perceptual hevc,” in [Multimedia andExpo Workshops (ICMEW), 2014 IEEE International Conference on ], 1–5, IEEE (2014).

[19] Caelli, T. and Moraglia, G., “On the detection of signals embedded in natural scenes,” Perception andPsychophysics 39, 87–95 (1986).

[20] Nadenau, M. J., Reichel, J., and Kunt, M., “Performance comparison of masking models based on a newpsychovisual test method with natural scenery stimuli,” Signal processing: Image communication 17(10),807–823 (2002).

[21] Watson, A. B., Borthwick, R., and Taylor, M., “Image quality and entropy masking,” Proceedings ofSPIE 3016 (1997).

[22] Chandler, D. M., Gaubatz, M. D., and Hemami, S. S., “A patch-based structural masking model with anapplication to compression,” J. Image Video Process. 2009, 1–22 (2009).

[23] Chandler, D. M., Alam, M. M., and Phan, T. D., “Seven challenges for image quality research,” in[IS&T/SPIE Electronic Imaging ], 901402–901402, International Society for Optics and Photonics (2014).

[24] Alam, M. M., Vilankar, K. P., Field, D. J., and Chandler, D. M., “Local masking in natu-ral images: A database and analysis,” Journal of vision 14(8), 22 (2014. Database available at:http://vision.okstate.edu/masking/).

[25] Alam, M. M., Vilankar, K. P., and Chandler, D. M., “A database of local masking thresholds in natural im-ages,” in [IS&T/SPIE Electronic Imaging ], 86510G–86510G, International Society for Optics and Photonics(2013).

[26] Legge, G. E. and Foley, J. M., “Contrast masking in human vision,” J. of Opt. Soc. Am. 70, 1458–1470(1980).

[27] Wei, Z. and Ngan, K. N., “Spatio-temporal just noticeable distortion profile for grey scale image/video indct domain,” Circuits and Systems for Video Technology, IEEE Transactions on 19(3), 337–346 (2009).

[28] Teo, P. C. and Heeger, D. J., “Perceptual image distortion,” Proceedings of SPIE 2179, 127–141 (1994).

[29] Foley, J. M., “Human luminance pattern-vision mechanisms: Masking experiments require a new model,”J. of Opt. Soc. Am. A 11(6), 1710–1719 (1994).

[30] Lubin, J., “A visual discrimination model for imaging system design and evaluation,” in [Vision Models forTarget Detection and Recognition ], Peli, E., ed., 245–283, World Scientific (1995).

[31] Watson, A. B. and Solomon, J. A., “A model of visual contrast gain control and pattern masking,” J. ofOpt. Soc. Am. A 14(9), 2379–2391 (1997).

[32] DeValois, R. L. and DeValois, K. K., [Spatial Vision ], Oxford University Press (1990).

[33] Daly, S. J., “Visible differences predictor: an algorithm for the assessment of image fidelity,” in [DigitalImages and Human Vision ], Watson, A. B., ed., 179–206 (1993).

[34] Larson, E. C. and Chandler, D. M., “Most apparent distortion: full-reference image quality assessment andthe role of strategy,” Journal of Electronic Imaging 19(1), 011006 (2010).

[35] Peli, E., “Contrast in complex images,” J. of Opt. Soc. Am. A 7, 2032–2040 (1990).

[36] Juan, C.-H. and Walsh, V., “Feedback to v1: a reverse hierarchy in vision,” Experimental Brain Re-search 150(2), 259–263 (2003).

[37] Vu, C. T., Phan, T. D., and Chandler, D. M., “S3: A spectral and spatial measure of local perceivedsharpness in natural images,” IEEE Transactions on Image Processing 21(3), 934–945 (2012).

[38] Costa, A. F., Humpire-Mamani, G., and Traina, A. J. M., “An efficient algorithm for fractal analysis oftextures,” in [Graphics, Patterns and Images (SIBGRAPI), 2012 25th SIBGRAPI Conference on ], 39–46,IEEE (2012).

[39] Phan, T. D., Shah, S. K., Chandler, D. M., and Sohoni, S., “Microarchitectural analysis of image qualityassessment algorithms,” Journal of Electronic Imaging 23(1), 013030 (2014).

[40] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P., “Gradient-based learning applied to document recogni-tion,” Proceedings of the IEEE 86(11), 2278–2324 (1998).

[41] Alam, M. M., Patil, P., Hagan, M. T., and Chandler, D. M., “A computational model for predicting localdistortion visibility via convolutional neural network trained on natural scenes,” in [IEEE InternationalConference on Image Processing (ICIP) ], Accepted to be presented, IEEE (2015).

Proc. of SPIE Vol. 9599 959918-13


[42] Patil, P., “Flexible image recognition software toolbox (first),” in [MS Thesis ], Oklahoma State University,Stillwater, OK (July, 2013).

[43] Riedmiller, M. and Braun, H., “A direct adaptive method for faster backpropagation learning: The rpropalgorithm,” in [Neural Networks, 1993., IEEE International Conference on ], 586–591, IEEE (1993).

[44] Sze, V., Budagavi, M., and Sullivan, G. J., [High Efficiency Video Coding (HEVC) ], Springer (2014).

[45] Wang, Z., Bovik, A. C., Sheikh, H. R., and Simoncelli, E. P., “Image quality assessment: from error visibilityto structural similarity,” Image Processing, IEEE Transactions on 13(4), 600–612 (2004).

Proc. of SPIE Vol. 9599 959918-14


a perceptual quantization strategy for hevc based on a...

Documents