ieee transactions on cybernetics 1 learning receptive

14
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON CYBERNETICS 1 Learning Receptive Fields and Quality Lookups for Blind Quality Assessment of Stereoscopic Images Feng Shao, Weisi Lin, Senior Member, IEEE, Shanshan Wang, Gangyi Jiang, Associate Member, IEEE, Mei Yu, and Qionghai Dai, Senior Member, IEEE Abstract—Blind quality assessment of 3D images encounters more new challenges than its 2D counterparts. In this paper, we propose a blind quality assessment for stereoscopic images by learning the characteristics of receptive fields (RFs) from per- spective of dictionary learning, and constructing quality lookups to replace human opinion scores without performance loss. The important feature of the proposed method is that we do not need a large set of samples of distorted stereoscopic images and the cor- responding human opinion scores to learn a regression model. To be more specific, in the training phase, we learn local RFs (LRFs) and global RFs (GRFs) from the reference and distorted stereo- scopic images, respectively, and construct their corresponding local quality lookups (LQLs) and global quality lookups (GQLs). In the testing phase, blind quality pooling can be easily achieved by searching optimal GRF and LRF indexes from the learnt LQLs and GQLs, and the quality score is obtained by combin- ing the LRF and GRF indexes together. Experimental results on three publicly 3D image quality assessment databases demon- strate that in comparison with the existing methods, the devised algorithm achieves high consistent alignment with subjective assessment. Index Terms—Blind image quality assessment, quality lookup, receptive field (RF), sparse coding, stereoscopic image. I. I NTRODUCTION W ITH the development of digital imaging and display devices, 3D media has become important for informa- tion representation. Similar to traditional 2D media, 3D media may be degraded at various stages and these degrada- tions may lead to loss of important visual information, poor 3D quality of experience and difficulties for subse- quent processing and analysis. Therefore, the problem of 3D image quality assessment (3D-IQA) plays a significant Manuscript received August 8, 2014; revised November 14, 2014, January 14, 2015, and March 13, 2015; accepted March 16, 2015. This work was supported in part by the Natural Science Foundation of China under Grant 61271021, Grant 61271270, and Grant U1301257, and in part by the K. C. Wong Magna Fund in Ningbo University. This paper was recommended by Associate Editor Y. Zhao. F. Shao, S. Wang, G. Jiang, and M. Yu are with the Faculty of Information Science and Engineering, Ningbo University, Ningbo 315211, China (e-mail: [email protected]). W. Lin is with the Centre for Multimedia and Network Technology, School of Computer Engineering, Nanyang Technological University, Singapore 639798. Q. Dai is with the Broadband Networks and Digital Media Laboratory, Tsinghua University, Beijing 100084, China. Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCYB.2015.2414479 role in various image/video processing and computer vision applications [1], [2]. Currently, quality assessment for 2D image/video has been extensively studied, and many criteria and tools have been proposed in [3]–[7]. Current state-of-the-art blind image qual- ity assessment (BIQA) methods rely on samples of distorted images and corresponding human opinion scores to learn a regression/neural network model that maps image fea- tures to quality scores. Many BIQA methods have been proposed, such as blind image quality indices [8], distor- tion identification-based image verity and integrity evalua- tion (DIIVINE) [9], blind/referenceless image spatial quality evaluator (BRISQUE) [10], blind image integrity notator using DCT statistics-II (BLIINDS-II) [11], and multiple ker- nel learning based on natural scene statistics [12]. Recently, deep learning technique has been successfully applied in BIQA by training deep multilayered neural network [13]–[15]. However, these methods need a large number of training images to learn a reliable regression model, but obtain- ing human opinion scores is not always easy. Recently, BIQA without human opinion scores for training has attracted much attention. Mittal et al. [16] applied probabilistic latent semantic analysis to quality aware visual words extracted from a large set of pristine and distorted images. Then, the uncov- ered latent quality factors are used to infer the quality for the test image. Xue et al.[17] proposed a quality-aware clustering method to learn a set of quality-aware centroids and use them as the codebook to infer the quality of an image patch. The final quality score for the test image is the weighted average of the patch level quality scores. However, the performance of these methods is inferior to state-of-the-art regression-based BIQA models. For 3D-IQA, various factors, such as 2D image quality, depth perception [18], visual comfort [19], and others [20], should be addressed. A number of full-reference (FR) qual- ity assessment approaches for stereoscopic images have been proposed. One straightforward way is to apply the existing 2D metrics to predict the quality of stereoscopic images directly [21]–[25]. Furthermore, 3D/binocular percep- tual properties have been taken into consideration [26]–[30]. The detailed description of these approaches has been sur- veyed in [31], researchers can refer to our previous paper for details. Only recently, some methods for 3D quality assessment have been proposed. De Silva et al. [32] pro- posed a FR stereoscopic video quality metric for symmetri- cally and asymmetrically compressed stereoscopic video by 2168-2267 c 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Upload: others

Post on 04-Jan-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IEEE TRANSACTIONS ON CYBERNETICS 1 Learning Receptive

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON CYBERNETICS 1

Learning Receptive Fields and Quality Lookups forBlind Quality Assessment of Stereoscopic Images

Feng Shao, Weisi Lin, Senior Member, IEEE, Shanshan Wang, Gangyi Jiang, Associate Member, IEEE,Mei Yu, and Qionghai Dai, Senior Member, IEEE

Abstract—Blind quality assessment of 3D images encountersmore new challenges than its 2D counterparts. In this paper, wepropose a blind quality assessment for stereoscopic images bylearning the characteristics of receptive fields (RFs) from per-spective of dictionary learning, and constructing quality lookupsto replace human opinion scores without performance loss. Theimportant feature of the proposed method is that we do not needa large set of samples of distorted stereoscopic images and the cor-responding human opinion scores to learn a regression model. Tobe more specific, in the training phase, we learn local RFs (LRFs)and global RFs (GRFs) from the reference and distorted stereo-scopic images, respectively, and construct their correspondinglocal quality lookups (LQLs) and global quality lookups (GQLs).In the testing phase, blind quality pooling can be easily achievedby searching optimal GRF and LRF indexes from the learntLQLs and GQLs, and the quality score is obtained by combin-ing the LRF and GRF indexes together. Experimental results onthree publicly 3D image quality assessment databases demon-strate that in comparison with the existing methods, the devisedalgorithm achieves high consistent alignment with subjectiveassessment.

Index Terms—Blind image quality assessment, quality lookup,receptive field (RF), sparse coding, stereoscopic image.

I. INTRODUCTION

W ITH the development of digital imaging and displaydevices, 3D media has become important for informa-

tion representation. Similar to traditional 2D media, 3D mediamay be degraded at various stages and these degrada-tions may lead to loss of important visual information,poor 3D quality of experience and difficulties for subse-quent processing and analysis. Therefore, the problem of3D image quality assessment (3D-IQA) plays a significant

Manuscript received August 8, 2014; revised November 14, 2014,January 14, 2015, and March 13, 2015; accepted March 16, 2015. This workwas supported in part by the Natural Science Foundation of China underGrant 61271021, Grant 61271270, and Grant U1301257, and in part by theK. C. Wong Magna Fund in Ningbo University. This paper was recommendedby Associate Editor Y. Zhao.

F. Shao, S. Wang, G. Jiang, and M. Yu are with the Faculty of InformationScience and Engineering, Ningbo University, Ningbo 315211, China (e-mail:[email protected]).

W. Lin is with the Centre for Multimedia and Network Technology,School of Computer Engineering, Nanyang Technological University,Singapore 639798.

Q. Dai is with the Broadband Networks and Digital Media Laboratory,Tsinghua University, Beijing 100084, China.

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCYB.2015.2414479

role in various image/video processing and computer visionapplications [1], [2].

Currently, quality assessment for 2D image/video has beenextensively studied, and many criteria and tools have beenproposed in [3]–[7]. Current state-of-the-art blind image qual-ity assessment (BIQA) methods rely on samples of distortedimages and corresponding human opinion scores to learna regression/neural network model that maps image fea-tures to quality scores. Many BIQA methods have beenproposed, such as blind image quality indices [8], distor-tion identification-based image verity and integrity evalua-tion (DIIVINE) [9], blind/referenceless image spatial qualityevaluator (BRISQUE) [10], blind image integrity notatorusing DCT statistics-II (BLIINDS-II) [11], and multiple ker-nel learning based on natural scene statistics [12]. Recently,deep learning technique has been successfully applied inBIQA by training deep multilayered neural network [13]–[15].However, these methods need a large number of trainingimages to learn a reliable regression model, but obtain-ing human opinion scores is not always easy. Recently,BIQA without human opinion scores for training has attractedmuch attention. Mittal et al. [16] applied probabilistic latentsemantic analysis to quality aware visual words extracted froma large set of pristine and distorted images. Then, the uncov-ered latent quality factors are used to infer the quality for thetest image. Xue et al. [17] proposed a quality-aware clusteringmethod to learn a set of quality-aware centroids and use themas the codebook to infer the quality of an image patch. Thefinal quality score for the test image is the weighted averageof the patch level quality scores. However, the performance ofthese methods is inferior to state-of-the-art regression-basedBIQA models.

For 3D-IQA, various factors, such as 2D image quality,depth perception [18], visual comfort [19], and others [20],should be addressed. A number of full-reference (FR) qual-ity assessment approaches for stereoscopic images havebeen proposed. One straightforward way is to apply theexisting 2D metrics to predict the quality of stereoscopicimages directly [21]–[25]. Furthermore, 3D/binocular percep-tual properties have been taken into consideration [26]–[30].The detailed description of these approaches has been sur-veyed in [31], researchers can refer to our previous paperfor details. Only recently, some methods for 3D qualityassessment have been proposed. De Silva et al. [32] pro-posed a FR stereoscopic video quality metric for symmetri-cally and asymmetrically compressed stereoscopic video by

2168-2267 c© 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: IEEE TRANSACTIONS ON CYBERNETICS 1 Learning Receptive

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON CYBERNETICS

measuring structural distortions, blurring artifacts, and con-tent complexity. Chen et al. [33] generated an intermediate“cyclopean” view from the stereoscopic image pair and dispar-ity information, and made a model to account for the influenceof binocular rivalry. Lin and Wu [34] incorporated binocularintegration behaviors into the existing 2D models to enhancethe ability in evaluating stereoscopic 3D images. Jin et al. [35]proposed a 3D image quality model by adopting three com-ponents, the cyclopean view, binocular rivalry, and the scenegeometry.

Only a few works have been conducted onno-reference/blind 3D-IQA. Ryu and Sohn [36] com-puted perceptual blurriness and blockiness scores of left andright images independently, and combined into an overallquality index by modeling the binocular quality perceptionin the context of blurriness and blockiness. Chen et al. [37]extracted both 2D and 3D features by natural scene statisticsfrom stereoscopic images and the corresponding dispar-ity map, and adopted support vector regression (SVR) tolearn a regression function to predict the quality of a teststereoscopic image pair. Gu et al. [38], [39] proposeda blind stereoscopic IQA metric by extracting 3D factorsof nonlinear additive model, ocular dominance model, andsaliency-based parallax compensation. Sazzad et al. [40]proposed a BIQA method for JPEG compressed stereoscopicimages based on local features of distortions and disparity.However, the existing BIQA for stereoscopic images still hasthe following problems: 1) for learning-based approaches,human subjective scores for stereoscopic images are noteasily acquired and 2) the evaluation can be distortionaware but should not be limited to any specific types ofdistortion.

From a neuro-biological point of view, the goal of IQA isto simulate the properties of visual perception [41], [42]. Asan important part of the human vision system (HVS), visualcortex is responsible for most of our visual perception. Inthe primary visual cortex (V1), simple and complex receptivefields (RFs) are to understand the behavior of visual percep-tion or to model their properties [43]. It has been suggestedthat sparse coding can offer quantitative predictions to be inline with the measurements from the visual cortex [44]. Manysparse coding methods have been designed for IQA purposes.Zhang et al. [45] used independent subspace analysis to simu-late the simple and complex cells’ responses. Chang et al. [46]adopted independent component analysis (ICA) to accom-plish the sparse coding process, and measured sparse featurefidelity as quality index. Guha and Ward [47] used sparsemodeling to learn the inherent structures, and estimated theperceptual quality by comparing the structures of the refer-ence and the distorted images in terms of the learnt basisvectors. Hunt et al. [48] analyzed various sparse codingmodels to binocular RF development across six abnormalrearing conditions. However, there is still less work onhow these sparse coding models can be applied to evaluatethe perceived quality of stereoscopic images, especially forthe BIQA.

In this paper, we proposed a blind quality assessmentmethod for stereoscopic images by learning RFs and quality

lookups. The newly proposed method takes the followingunique features.

1) In the training stage, we learn local dictionaries fromthe reference and distorted stereoscopic image patchesto construct local RFs (LRFs), and establish their cor-responding local quality lookups (LQLs) by measuringthe amplitude and phase differences between the LRFs.We use combinations of LRF, and LQL to infer the localproperties of the training samples.

2) To account for the properties of binocular complex cells,we learn global dictionaries from the binocular energyresponses of the reference and distorted stereoscopicimages to construct global RFs (GRFs), and establishthe corresponding global quality lookups (GQLs) bymeasuring the gradient similarities between the binoc-ular energy responses. We use combinations of GRF,and GQL to infer the global properties of the trainingsamples.

3) In the testing stage, blind quality pooling for a teststereoscopic image pair can be easily achieved bysearching the optimal GRF and LRF indexes from thelearnt LQLs and GQLs, respectively, and the final qual-ity score is obtained by combining the LRF and GRFindexes together.

4) We design two training databases (i.e., symmetric andasymmetric training databases) to reveal if the selectionof different training databases will affect the addressingof binocular rivalry in the test stage, and find that thetraining database composed of symmetrically distortedstereoscopic images is a good choice for asymmetricdegradations.

The rest of this paper is organized as follows. Section IIanalyzes the relevant background about the RFs. Section IIIpresents the proposed blind quality assessment approach. Theexperimental results are given and discussed in Section IV, andfinally the conclusion is drawn in Section V.

II. SIMULATION OF THE RFS

There have been significant advances in the understandingof the roles of V1 for binocular vision, and the full neuralprocessing in binocular vision involves visual pathways invarious visual areas (i.e., V1–V5) [49]. Since the understand-ing of V2–V5 is beyond the scope of this paper, we do notdescribe them in this section. For 3D-IQA, the new challengesmainly come from the understanding of RFs, and we brieflyreview some RF models, which are related to this paper, inthe following.

It is known that binocular disparity is the position differ-ence between the retinal images on the left and right eyes,which serves as one of the important cues for depth percep-tion. The basic neural properties for disparity encoding are thetwo RFs in V1 and their positions. Disparity energy model(also defined as binocular energy model in some literature)explains the response properties of the binocular complex cellin V1 [50]. As shown in Fig. 1, in the simple cell stage, eachsimple cell in the left and right eyes is generated by simulat-ing spatial RFs (e.g., 2D Gabor functions). The response of

Page 3: IEEE TRANSACTIONS ON CYBERNETICS 1 Learning Receptive

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SHAO et al.: LEARNING RFs AND QUALITY LOOKUPS FOR BLIND QUALITY ASSESSMENT OF STEREOSCOPIC IMAGES 3

Fig. 1. Energy model of a binocular complex cell. Individual complexcells (C) consists of two simple cells (S) and combine across SF, orientation,and disparity.

simple cells are summed up and squared with positive andnegative phase difference. Then in the complex cell stage,responses are combined across spatial frequency (SF), orienta-tion, and disparity. For the neural coding of binocular disparity,some position- and phase-shift-based energy models wereproposed in [51] and [52]. An important issue for the under-standing of stereoscopic vision is to consider these complexcells to encode disparity, different from the existing algorithmsbased on matching properties from each monocular left andright images. However, distortions (especially for asymmet-ric degradations) will introduce binocular rivalry phenomenonwhich requires the disparity information to fuse/suppress theRF responses from both eyes [53]. Especially, for BIQA with-out any benchmark, how to estimate image quality based onthe strength of the response is especially challenging.

To obtain a monocular RF, previous studies have shownthat the RF properties of simple cell in V1 can be representedvia sparse coding [54]. The sparse coding model encodes animage patch I(x, y) with N pixels is approximately a linearsuperposition of M basis vectors {φi(x, y)}

I(x, y) ≈M∑

i=1

aiφi(x, y) (1)

where the sparse feature {ai} represents the population activityfor a neuron.

The learnt basis vectors {φi(x, y)} by sparse coding resemblethe RF properties of simple cells in V1 [54]. The purpose ofsparse coding is to learn dictionaries from images to simulatethe RFs. However, how to use the sparse coding model to learncomplex cells is still an unresolved problem [55], because thetopography structure and network architecture of complex cellsare usually undetermined and hard to quantify (to be resolved

Fig. 2. Example of RFs over both eyes. (a) Reference, (b) JPEG,(c) JPEG 2000, (d) Glur, and (e) white noise quantify the changes inRF structure and distribution induced by different distortions.

in cognitive psychology research). Thus, binocular RF frommonocular RFs is a feasible way to understand neural mecha-nisms for binocular processing. In [56], in order to account forV1 complex cells, with the RFs wleft and wright from each eye,binocularity (ocular dominance) was quantified by

b = ‖wleft‖ − ∥∥wright∥∥

‖wleft‖ + ∥∥wright∥∥ . (2)

Here, b is to characterize the RF properties from monocularto binocular response. A large absolute value of b corre-sponds to highly monocular responses while small absolutevalue corresponds to binocular responses. Since sparse featurecan reflect the population activity of a neuron, binocular RFscan be derived from the monocular RFs based on their sparsefeatures (to reflect the monocularity) [57]. Fig. 2 shows themonocular RFs of stereoscopic images under different distor-tion regiments. It is obvious that the structure and distributioninformation of these RFs are changed induced by different dis-tortion types and strengths, and the ocular dominance betweenleft and right eyes are simultaneously affected (to producebinocular rivalry in the case of distortions). Therefore, it isnecessary to address the sparsity of monocular RFs in mea-suring difference of two images (patches) (this is motivated usto consider ocular dominance as the basis of binocular rivalry).

Since binocular perception occurs in visual cortex, we canbriefly describe the process of binocular RFs as: 1) obtainingsimple and complex cell responses for each eye; 2) learningRF properties by sparse coding; and 3) the advanced visualrepresentations from V1 to V5. Since visual cortex of thebrain is usually very complex and is not well understood yet,in this paper (as to be described in the next section), wetry to simulate the process in quality assessment instead ofmodeling it.

III. PROPOSED BLIND QUALITY ASSESSMENT METHOD

FOR STEREOSCOPIC IMAGES

The high-level diagram for the proposed blind qualityassessment framework is given in Fig. 3. The process is com-posed of training and testing phases. In the training phase, fora selected training dataset, by implementing sparse coding forlocal image patches and feature vectors from global binocularenergy responses, we learn LRFs and GRFs, respectively, andconstruct their corresponding LQLs and GQLs. In the testing

Page 4: IEEE TRANSACTIONS ON CYBERNETICS 1 Learning Receptive

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON CYBERNETICS

Fig. 3. Proposed blind quality assessment framework for stereoscopic images.

phase, for an arbitrary test stereoscopic image pair, LRF indexis computed with respect to, the learnt LRFs and LQLs, andGRF is computed with respect to, the learnt GRFs and GQLs.The essence of the proposed blind quality pooling is to inter-polate optimal quality values from the quality lookups basedon the RFs. Finally, the final quality score is obtained by com-bining the LRF and GRF indexes together. Therefore, how tobuild these LRFs and GRFs, LQLs and GQLs in the trainingphase, and how to design LRF and GRF indexes in the testingphase, are the challenges for the success of the approach.

A. Learning LRFs

Generally, the goal of sparse coding is to find the optimalbasic vectors so that the image can be represented by its sparsefeatures. Recently, many sparse coding models have been pro-posed to simulate the RFs of simple cell in V1 [45]–[47].In this paper, we use FastICA [58] to learn a set of sparse basisvectors which is the sparse representation of image patches,because ICA is fundamental and more suitable to fit the RFproperties in the cortex [59]. Since the FastICA algorithm iswell known, we do not describe it further here. For the detailsof the algorithm, please refer to [58].

In the implementation, 18 000 overlapped patches withsize 8×8 are randomly taken from each training image to learna set of 64 sparse basic vectors (only luminance componentis involved). Before applying ICA, we first use principal com-ponent analysis (PCA) to reduce the dimension of the samplevectors, and only retain the first eight principal components ofthe samples for training (the design and parameter selection ofPCA is based on [46]). Thus, the RF is to be represented in theform of a 64 × 8 matrix (each row of the matrix correspondsto a RF unit) for the jth training image, denoted as Rj. Here,

the sparse basic vectors learnt by ICA serve as RFs of V1 neu-rons. For the ith sparse basic vector in the jth training image,

the corresponding RF is denoted as�r i,j,

Rj= {�r i,j |1 ≤ i ≤ 8}.However, if the training image is degraded, the learnt

RFs are degraded simultaneously (as shown in Fig. 2). Asdiscussed in [31], phase and amplitude have different contri-butions in determining image quality. The RFs demonstratecertain amplitude and phase differences, while V1 can pro-vide suitable substrates for amplitude–phase encoding [60]. Inorder to measure the perceptual quality for each neural unit(e.g., RF), the amplitude and phase differences between thereference and distorted RFs are calculated by

mi,j =∥∥∥�r ref

i,j

∥∥∥2−∥∥∥�r dis

i,j

∥∥∥2

(3)

pi,j = arccos

⎜⎝

⟨�r ref

i,j ,�r dis

i,j

∥∥∥�r ref

i,j

∥∥∥2·∥∥∥�r dis

i,j

∥∥∥2

⎟⎠ (4)

where�r ref

i,j and�r dis

i,j are the pair of RFs corresponding to thereference and distorted images, respectively, ‖·‖2 is the l2 norm,and 〈·〉 calculates the inner product.

With the calculated mi,j and pi,j, we estimate the perceptualquality by [45]

�q i,j = 1

1 + a1 · (|mi,j |C1 · (pi,j)C2)b1. (5)

Here, parameters a1 and b1 control the curve shape, andC1 and C2 control the responses to the amplitude and phase.In the experiment, we set a1 = 6, b1 = −2, C1 = 0.6, and

Page 5: IEEE TRANSACTIONS ON CYBERNETICS 1 Learning Receptive

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SHAO et al.: LEARNING RFs AND QUALITY LOOKUPS FOR BLIND QUALITY ASSESSMENT OF STEREOSCOPIC IMAGES 5

Fig. 4. Example of the LRFs from asymmetric training database.

C2 = 0.5, similar to [45]. The corresponding quality lookup for

the jth training image is denoted as�q j,

�q j= {�q i,j |1 ≤ i ≤ 8}.

Since the RFs above are learnt from image patches (mainlyreflecting an image’s local properties), we define these RFs asLRFs, and their corresponding quality lookup as LQL. For allleft and right images in the training dataset, we can train four

sets of LRFs ({�RL, refj }, {�RL, dis

j }, {�RR, refj }, and {�RR, dis

j }) and their

corresponding LQLs ({�qLj } and {�qR

j }). Fig. 4 shows an exampleof the constructed LRFs for the left and right images from theasymmetric training database under different distortion types(the used asymmetric training database will be described inSection IV-B). The LRF for each image is in the forms of 64×8. Each 8×8 patch in Fig. 4 corresponds to one RF unit. Thus,all the RFs for each distortion type form six (reflect distortionstrength) sets of 8 × 8 patches. So these RFs serve as neuronsin visual responses.

B. Learning GRFs

From the results of psychology research [61], the spatialRF of monocular simple cells in V1 can be well mod-eled by Gabor functions. Binocular complex cells differ frommonocular simple cells by adding the stimulus of phaseshift. In this aspect, two RFs from each eye are located atcorresponding points on the two retinas, but have differentphase between the two RFs [62]. Since spatial frequencies inthe Gabor functions are dependent on the viewing distance,low-frequency-selective neurons should be encoded with largedisparities, and high-frequency-selective neurons should beencoded with small disparities [63]. Each neuron has its pre-ferred SF and orientation. Based on these considerations, inthis paper, we first use Gabor filter to compute the responsesof left and right images (stimuli from different spatial fre-quencies and orientations), denoted as Cref

L (x, y;ω, θ) andCref

R (x, y;ω, θ), respectively, and then compute the binocularenergy response with a phase shift by simulating a stimulusof appropriate disparity appearing in both eyes

Eref(x, y;�ψ,ω, θ)=∥∥∥C̄ref

L (x, y;ω, θ)+ ej�ψ C̄refR (x, y;ω, θ)

∥∥∥2. (6)

Since the distortion will affect the perceived depth whenencountering conflicts of binocular fusion between the twoviews. In order to solve this problem and to highlightdisparity-related activities, we add a phase shift by sim-ulating a stimulus of appropriate disparity appearing inboth eyes. Here, different disparities is tuned by vary-ing phase shifts between −π and +π . In the experiment,

we use the normalized Gabor responses C̄refL (x, y;ω, θ) and

C̄refR (x, y;ω, θ) to construct the binocular energy, in order

to reduce the influence of “interocular” contrast differences.Eight orientations, 0, π /4, π /2, 3π /4, π , 5π /4, 3π /2, and 7π /4,are used with five different spatial frequencies, 1.74, 2.47,3.49, 4.93, and 6.98 (cycles/degree) (the scale is reflected bythe spatial frequencies) under nine different phase shifts, −π ,−7π /8, −3π /4, −5π /8, −π /2, −3π /8, −π /4, −π /8, and 0(the disparity is reflected by the phase shifts). Similarly, thebinocular energy response Edis(x, y;�ψ,ω, θ) of the distortedstereoscopic image can be calculated.

In [64], for each component of the binocular energyresponses, we calculate the magnitude mref

k to encode the gen-eralized spectral behavior, the variance vref

k to describe thefluctuations of the energy, and the entropy eref

k to representthe generalized information by

mrefk = 1

NR

(x,y)∈R

log2

∣∣∣Erefk (x, y)

∣∣∣ (7)

vrefk = 1

NR

(x,y)∈R

log2

∣∣∣Erefk (x, y)− mref

k

∣∣∣ (8)

erefk = 1

NR

(x,y)∈R

p[Eref

k (x, y)]

· ln p[Eref

k (x, y)]

(9)

where Erefk (x, y) is the response of the kth component, NR is

the number of pixels in an image, and p[ · ] is the probabilitydensity function of the response.

Finally, the feature vector at each phase shift is representedby combining the feature values from all orientations andspatial frequencies

fref�ψ =

[mref

1 ,mref2 , . . . ,mref

40 , vref1 , vref

2 , . . . , vref40 ,

eref1 , eref

2 , . . . , eref40

]. (10)

Similarly, we perform FastICA algorithm to learn sparsebasis vectors from the feature vector. Since these sparsebasis vectors are learnt from global statistical features (mainlyreflect image global properties), we define these sparse basisvectors as GRFs. For the reference and distorted stereoscopicimages, we learnt two sets of GRFs ({R̆ref

j,�ψ } and {R̆disj,�ψ }), and

each R̆refj,�ψ or R̆dis

j,�ψ is represented in the form of an 8 × 120matrix (each row of the matrix corresponds to one GRF unit).

Overall, the properties of the extracted features corre-spond quite well to those RFs measured for visual per-ception. Then, another important aspect is how to calculatethe perceptual quality for the distorted stereoscopic imagesbased on the features. To this end, we measure the simi-larity between the gradient vectors of the energy responses.More specifically, we define the phase difference betweenthem as

PD(x, y;�ψ,ω, θ)= arccos

( ⟨Gref(x, y;�ψ,ω, θ),Gdis(x, y;�ψ,ω, θ)⟩∥∥Gref(x, y;�ψ,ω, θ)∥∥2 · ∥∥Gdis(x, y;�ψ,ω, θ)∥∥2

)

(11)

Page 6: IEEE TRANSACTIONS ON CYBERNETICS 1 Learning Receptive

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON CYBERNETICS

Fig. 5. Example of the GRFs from asymmetric training database.

where

Gref(x, y;�ψ,ω, θ)=[∂Eref(x, y;�ψ,ω, θ)

∂x,∂Eref(x, y;�ψ,ω, θ)

∂y

]

(12)

Gdis(x, y;�ψ,ω, θ)=[∂Edis(x, y;�ψ,ω, θ)

∂x,∂Edis(x, y;�ψ,ω, θ)

∂y

].

(13)

As in [65], the phase difference (11) is weighted by a func-tion ρ() to penalize gradient vectors but not its phase

ρ(x, y;�ψ,ω, θ) = cos(2 · PD(x, y;�ψ,ω, θ))+ 1

2. (14)

Since the stimuli from phase shifts provide the main binoc-ular depth perception, in order to reflect different levels ofstimulus disparity, quality score for each pixel is calculated bysumming the phase differences across all spatial frequenciesand orientations

q̆j,�ψ(x, y) = 1

H

ω

θ

ρ(x, y;�ψ,ω, θ) (15)

where H is the number of all spatial frequencies and orienta-tions, thus the quality score is normalized to [0, 1].

The final quality over all pixels is pooled by summing as

q̆j,�ψ =∑(x,y)∈R

q̆j,�ψ(x, y)

NR(16)

where NR is the number of pixels in an image.The adopted quality estimation method is provided effec-

tively for FR stereoscopic image quality evaluation in compar-ison with the existing IQA methods. The corresponding GQLfor the jth training image is represented as q̆j, q̆j = {q̆j,�ψ }.Fig. 5 shows an example of the constructed GRFs fromthe asymmetric training database under different distortiontypes. All the RFs for each distortion type form six setsof 8 × 120 matrices.

C. Blind Quality Pooling Based on LRFs

Different with the existing BIQA metrics [16]–[19] that usecomplex regression testing for quality pooling, a simple buteffective strategy is subsequently presented to estimate theperceptual quality scores. For a test stereoscopic image pair,we first separate it into nonoverlapped patches with size of8 × 8 (the same size with one LRF unit), and calculate thesparse coefficient vectors with respect to, the learnt

RL, refj and

RL, disj by

aLt,j =

(�RL, ref

j

)T × yt (17)

bLt,j =

(�RL, dis

j

)T × yt (18)

where yt denotes the test image patch vectors. Since the size of�

RL, refj and

RL, disj is 64×8, the length of aL

t,j and bLt,j is 8. Each

element in aLt,j and bL

t,j reflects the sparsity of the sparse basisvector. To facilitate the following analysis, aL

t,j and bLt,j are

expressed as aLt,j = {aL

t,i,j|1 ≤ i ≤ 1}, bLt,j = {bL

t,i,j|1 ≤ i ≤ 1},where aL

t,i,j and bLt,i,j are the sparse coefficients for the ith

sparse basis vectors.According to [66], neurons with the same responses will

always have the consistent sparse coefficients (to resolve clas-sification issue). Inspired by this, we calculate the minimumdistance for each sparse coefficient across all the trainingimages

δLt,i = min

j

(aL

t,i,j − bLt,i,j

)2. (19)

The recorded training image with the minimum distance δLt,i

is expressed as j±. In (3) and (4), we calculate the differencesbetween two basis vectors to infer their quality. According tothe property that an image patch can be represented by a linearsuperposition of a set of basis vectors, the shorter the distanceδL

t,i is the more likely patch yt should have the same qualitylevel with that in the LRF. Based on such consideration, weuse the quality score qL

i,j± from the LQL to determine thequality score of the patch

zLt =

∑Mi=1

�q L

i,j± · exp(−δL

t,i/λ)

∑Mi=1 exp

(−δLt,i/λ

) (20)

where M is the number of retained sparse basis vectors, M = 8,and λ is a parameter to control the weight exp(−δL

t,i/λ) withrespect to the distance δL

t,i. In the experiment, we set λ = 300.We normally take the average quality scores of all patches as

the final quality. However, for blind quality pooling, since theconstructed LRFs in the training stage are learnt from numer-ous patches, the quality scores of different patches in the testimage will be quite different (i.e., the reconstruction errorsfor these patches will be large). We only select these patcheswith strong visual responses for quality pooling. To do this,we define a threshold as

Th = k1

N

N∑

t=1

∥∥∥aLt,j±∥∥∥

2

2(21)

where N is the number of all patches in an image and k1 isa parameter to adjust the threshold. We set k1 = 0.4 in theexperiment, similar to [46]. By selecting those patches satis-fying ‖aL

t,j±‖2 > Th, we calculate the quality score for the leftimage by

QL = 1

N′N′∑

t=1

zLt (22)

where N′

is the number of the selected patches in the image.The quality score QR for the right image can be calculated inthe same manner.

Page 7: IEEE TRANSACTIONS ON CYBERNETICS 1 Learning Receptive

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SHAO et al.: LEARNING RFs AND QUALITY LOOKUPS FOR BLIND QUALITY ASSESSMENT OF STEREOSCOPIC IMAGES 7

Since the RF for each eye will be changed when respondingto left-right discrepancy/dissimilarity, especially for asymmet-rically distorted stereoscopic images, inspired by the definitionof ocular dominance in (2), we calculate the sparse com-plexity for each eye to characterize the binocular interaction(to account for binocular rivalry in the case of distortions),and establish a link between both 2D and 3D image qualities.The definitions of the weights for the left and right images aregiven, respectively, by

ρL =∑N′

t=1

∥∥∥bLt,j±∥∥∥

2∑N′

t=1

∥∥∥bLt,j±∥∥∥

2+∑N′

t=1

∥∥∥bRt,j±∥∥∥

2

(23)

ρR =∑N′

t=1

∥∥∥bRt,j±∥∥∥

2∑N′

t=1

∥∥∥bLt,j±∥∥∥

2+∑M

t=1

∥∥∥bRt,j±∥∥∥

2

. (24)

It is important for the weights that sparse complexity isdefined as the sparsity of the sparse coefficient vectors aver-aged over all patches. The definitions are derived from oculardominance to represent the binocular interaction between leftand right eyes [67], but the basis vectors in (2) are replacedby the sparse coefficients. As analyzed in (21), the sparsecoefficients reflect visual responses. Naturally, the eye withstrong visual response will have a large weight in binocu-lar combination. Our experimental results also find that theproposed weights have positive contribution to the achievedimprovement in 3D quality evaluation.

With the above defined weights ρL and ρR, we derive thefinal LRF index as

QLRF = ρL · QL + ρR · QR. (25)

D. Blind Quality Pooling Based on GRFs

For a test stereoscopic image pair, we first extract the featurevector g�ψ under a given phase shift by (5)–(9) across allspatial frequencies and orientations, and calculate the sparsecoefficient vectors based on R̆ref

j,�ψ and R̆disj,�ψ by

cj,�ψ =(

R̆refj,�ψ

)T × g�ψ (26)

dj,�ψ =(

R̆disj,�ψ

)T × g�ψ. (27)

Then, the minimum distance for each sparse coefficientvector across all the training images is calculated as

τ�ψ = minj

∥∥cj,�ψ − dj,�ψ∥∥2. (28)

The recorded training image with the minimum distance isexpressed as j∓. Since the GQLs are calculated for each phaseshift, the final GRF index is determined by fusing the impactsof all phase shifts

QGRL =∑�ψ q̆jm,�ψ · exp

(−τ�ψ/λ)

∑�ψ exp

(−τ�ψ/λ) . (29)

E. Final Quality Pooling

The final index is calculated by combining QLRF and QGRFinto a quality score by

Q = γ · QLRF + (1 − γ ) · QGRF (30)

where 0 < γ < 1 is a parameter for adjusting the relativeimportance of the two components, and its proper value is 0.2(to be explained in Section IV-E).

IV. EXPERIMENTAL RESULTS AND ANALYSES

A. Databases and Performance Measures

1) NBU 3D IQA Database [31]: It consists of 312 distortedstereoscopic pairs generated from 12 reference stereo-scopic images. Five types of distortions, JPEG, JPEG2000 (JP2K), Gaussian Blur (GB), White Noise (WN),and H.264, are symmetrically applied to the left andright reference stereoscopic images at various levels forthe database.

2) LIVE 3D IQA Database Phase I [33]: It consists of365 distorted stereoscopic pairs generated from 20 ref-erence stereoscopic images. Five types of distortions,JPEG, JP2K, GB, WN, and Fast Fading (FF), aresymmetrically applied to the left and right referencestereoscopic images at various levels for the database.

3) LIVE 3D IQA Database Phase II [68]: It consistsof 120 symmetrically distorted stereoscopic pairs and240 asymmetrically distorted stereoscopic pairs gener-ated from eight reference stereoscopic images, respec-tively. Five types of distortions, JPEG, JP2K, GB, WN,and FF, are symmetrically and asymmetrically appliedto the left and right reference stereoscopic images atvarious levels for the database.

In our experiment, refer to [17], in order to facilitate thesimulation, we only consider four types of distortions that arecommon to the three databases: 1) JPEG; 2) JP2K; 3) GB;and 4) WN. Thus, we use 60, 60, 60, and 60 distortedstereoscopic images in the NBU 3D IQA database, 80, 80,45, and 80 distorted stereoscopic images in the LIVE 3DIQA database phase I, and 72, 72, 90, and 72 distorted stereo-scopic images in the LIVE 3D IQA database phase II for theJPEG, JP2K, GB, and WN distortions, respectively. Of course,only considering types of distortions is not comprehensive.Further work is required on more comprehensively addressingof various distortion types, and constructing a multiply dis-torted stereoscopic image database is a feasible way to learnRFs [69].

In this paper, two commonly used performance indicatorsare used to benchmark the proposed metric against the rele-vant state-of-the-art techniques: 1) Pearson linear correlationcoefficient (PLCC) and 2) Spearman rank order correlationcoefficient (SRCC), between the objective and subjectivescores. For a perfect match between the objective and subjec-tive scores, PLCC = SRCC = 1. For the nonlinear regression,we use the following five-parameter logistic function [70]:

DMOSp = β1 ·(

1

2− 1

1 + exp(β2 · (x − β3))

)+ β4 · x + β5

(31)

Page 8: IEEE TRANSACTIONS ON CYBERNETICS 1 Learning Receptive

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON CYBERNETICS

Fig. 6. Left images of the training images used in this paper. (a) Symmetricand (b) asymmetric training databases.

where β1, β2, β3, β4, and β5 are determined by using thesubjective scores and the objective scores.

B. Selection of Training Database

To analyze if the selection of training databases will affectthe addressing of binocular rivalry (mainly from asymmet-ric degradations) in the testing stage, we define two trainingdatabases (i.e., symmetric and asymmetric training databases).For the symmetric training database, we select ten pairs of ref-erence stereoscopic images from NBU 3D IQA database andLIVE 3D IQA database phase I, respectively, which have dif-ferent scenes in the images [the original left images of thesymmetric database refer to Fig. 6(a)]. For each image, referto [17], we select its distorted image of each distortion type(JPEG, JP2K, GB, and WN) on four quality levels. The distor-tions are symmetrically applied to the left and right referencestereoscopic images. Thus, we obtain a symmetric dataset of160 pairs of distorted stereoscopic images and 10 pairs of ref-erence stereoscopic images. For asymmetric training database,we select 8 pairs of reference stereoscopic images from LIVE3D IQA database phase II [the original left images of theasymmetric database refer to Fig. 6(b)]. For each image, weselect its distorted image of each distortion type (JPEG, JP2K,GB, and WN) on six quality level. The distortions are asym-metrically applied to the left and right reference stereoscopicimages. Thus, we obtain an asymmetric dataset of 192 pairsof distorted stereoscopic images and eight pairs of referencestereoscopic images.

C. Comparison With FR-IQA Metrics

The PLCC and SRCC of the proposed scheme on the threedatabases are given in Tables I–III. Two FR 2D-IQA (struc-tural similarity (SSIM) and Multiscale-SSIM (MS-SSIM)) andtwo 3D-IQA metrics (Bensalma’s scheme [28] and Chen’sscheme [33]) are used for reference (due to the limitation ofspace in the tables, we use “A [28]” and “A [33]” to rep-resent two 3D-IQA metrics, respectively). We use “Pro-S”and “Pro-A” to represent the proposed scheme with symmet-ric and asymmetric training databases in the training stage,respectively. Table I shows that the overall performance ofthe proposed scheme is better than SSIM and MS-SSIM, andthe proposed scheme (both Pro-S and Pro-A) performs bet-ter than Bensalma’s scheme on JPEG, WN distortions, better

TABLE IPERFORMANCE COMPARISON OF THE SIX SCHEMES ON

NBU 3D IQA DATABASE

TABLE IIPERFORMANCE COMPARISON OF THE SIX SCHEMES ON

LIVE 3D IQA PHASE I DATABASE

than Chen’s scheme on GB distortion. Table II shows that theproposed scheme (Pro-S) outperforms Bensalma’s scheme onall distortions and the overall evaluation, and Chen’s schemeon JP2K and GB distortions. Table III shows that the pro-posed scheme (Pro-S) performs better than Bensalma’s schemeon JP2K and GB distortions and the overall evaluation, andChen’s scheme on GB distortion. Overall, the performance ofthe proposed scheme is always competitive compared with the2D-IQA metrics.

As shown in [16] and [17], BIQA methods without humanopinion scores for learning are inferior to those FR-IQA andBIQA methods with human opinion scores for learning, butthe proposed scheme (without human opinion scores) is still

Page 9: IEEE TRANSACTIONS ON CYBERNETICS 1 Learning Receptive

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SHAO et al.: LEARNING RFs AND QUALITY LOOKUPS FOR BLIND QUALITY ASSESSMENT OF STEREOSCOPIC IMAGES 9

Fig. 7. Scatter plots of predicted quality scores against the subjective scores (DMOS) of the proposed scheme with symmetric training database.(a) NBU 3D IQA. (b) LIVE 3D IQA phase I. (c) LIVE 3D IQA phase II.

Fig. 8. Scatter plots of predicted quality scores against the subjective scores (DMOS) of the proposed scheme with asymmetric training database.(a) NBU 3D IQA. (b) LIVE 3D IQA phase I. (c) LIVE 3D IQA phase II.

TABLE IIIPERFORMANCE COMPARISON OF THE SIX SCHEMES ON

LIVE 3D IQA PHASE II DATABASE

competitive in some distortion types. In the stereoscopic case,the major difficulty is that how to select the training imagesto account for binocular rivalry. In the proposed scheme,with symmetric or asymmetric stereoscopic images for train-ing, the evaluation performances are quite different, becausethe relationship between the left-right RFs and the quality

scores are different in the two databases. The training databasecomposed of symmetrically distorted stereoscopic images isa good choice for the evaluation of asymmetric degradations,except for some distortion types with symmetric degradations.Therefore, the performance of the proposed scheme can befurther improved by properly selecting the training images.This is also a challenge for quality assessment of stereoscopicimages. Figs. 7 and 8 show the scatter plots of predictedquality scores against subjective quality scores (in terms ofdifference mean opinion scores) of the proposed scheme withdifferent symmetric and asymmetric training databases in thetraining stage, respectively.

D. Comparison With Other BIQA Metrics

In order to compare with state-of-the-art BIQA metrics,DIIVINE [9], BRISQUE [10], and BLIINDS-II [11] are usedfor reference. In order to apply these metrics to 3D case,feature vectors are extracted separately for the left and rightimages, and weight-averaged to obtain the final feature vec-tor for training. We present their results under three settings:80%, 50%, and 30% samples are used for training and theremaining for testing. The process is repeated for 1000 timeswith random training and testing partition, and the averageresults are obtained. The PLCC and SRCC have been givenin Tables IV and V for the three databases. We can find that,when 80% samples are used for training in the BIQA metrics,the average performance of the proposed scheme has a littlelower than BIIVINE and BRISQUE. But when 50% samples

Page 10: IEEE TRANSACTIONS ON CYBERNETICS 1 Learning Receptive

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON CYBERNETICS

TABLE IVPLCC COMPARISONS WITH OTHER BIQA METHODS

TABLE VSRCC COMPARISONS WITH OTHER BIQA METHODS

are used for training, the proposed scheme outperforms theBIQA metrics on most of the databases. Obviously, the per-formance of these BIQA metrics decreases rapidly with thedecrease of training samples, while the training samples arefixed in the proposed scheme (do not need cross-validationoperation).

E. Impact of Each Component in the Proposed Scheme

To demonstrate the impact of LRF and GRF in the pro-posed scheme, we design two schemes for comparison thatonly use LRF or GRF index for evaluation. The results ofPLCC and SRCC are presented in Tables VI and VII. In thetables, only the results with symmetric training database arelisted, and similar results are found with asymmetric trainingdatabase. From the tables, we can see that only adopting LRFor GRF cannot obtain the best performance for most of distor-tion types, while the evaluation performance can be promotedby properly combining the LRF and GRF indexes. In thispaper, the parameter γ is trained on the NBU 3D IQA databasevia optimizing the SRCC, and the same parameter value is

used for all other databases. Even the parameter is decidedas: γ = 0.20 (i.e., the GRF component is more important thanLRF component in this regard), independently applying GRFcannot obtain the best evaluation performance for some dis-tortions. Local structure degradations induced by JPEG, JP2K,and WN distortions cannot be well addressed by LRF, whileGRF can compensate for the problem. That is, the performancecan be improved if we independently train the parameter λ foreach distortion type and each database.

F. Discussion

1) Degenerated Cases: Our method is based on fusing mul-tiple quality scores based on the constructed quality lookup.Therefore, the major scenario causes our method to failis when the quality lookups lose their power (cannot cor-rectly map from basic vectors to quality scores). The usedquality estimation model is not optimal for all distortiontypes in IQA [71], [72]. Therefore, it is beneficial to gener-ate combined synthetic scores for multiple FR-IQA measures,as done in [73]. In another aspect, the sparse basis vectors

Page 11: IEEE TRANSACTIONS ON CYBERNETICS 1 Learning Receptive

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SHAO et al.: LEARNING RFs AND QUALITY LOOKUPS FOR BLIND QUALITY ASSESSMENT OF STEREOSCOPIC IMAGES 11

TABLE VIPLCC COMPARISON FOR EACH COMPONENT OF THE PROPOSED SCHEME

TABLE VIISRCC COMPARISON FOR EACH COMPONENT OF THE PROPOSED SCHEME

learnt by ICA are incomplete (the number of bases is lowerthan the dimensionality of the input), so that the calculateddistances between the estimated sparse coefficients do notalways reflect the real difference between the images (patches),while overcomplete representation is an important propertyin V1 [74]. Therefore, overcomplete sparse basis vectors canbe considered by using other sparse coding methods.

2) Influence of Training Images: The RFs (LRF and GRF)and quality lookups (LQL and GQL) are based on the train-ing database. In the experiment, we find that the proposedscheme with the training database composed of symmetricallydistorted stereoscopic images is effective to address the issueof binocular rivalry, especially for asymmetric degradations,but the situation may be not always satisfied for all distortions.Since the LRFs and LQLs are independently constructed foreach left and right image in the training stage, the connectiv-ity between left and right views is mainly addressed by sparsecomplexity in the test stage. Furthermore, how to address theconnectivity for different types of distortion, how to constructhybrid database (e.g., multiple distortion types, or symmetricand asymmetric hybrid), should be considered.

V. CONCLUSION

A blind quality assessment method for stereoscopic imagesis present by learning RFs and quality lookups, from

perspective of dictionary learning and replacing human opin-ion scores. The main contribution of this paper is that wedevelop RFs and quality lookups to guide blind quality pool-ing, so that we do not need to learn a regression model froma large set of samples of distorted stereoscopic images andcorresponding human opinion scores. To be more specific, thecontributions of the proposed scheme: 1) we learn LRFs fromthe reference and the distorted stereoscopic image patches, andconstruct their corresponding LQLs to infer the local proper-ties; 2) we learn GRFs from the reference and the distortedbinocular energy responses, and construct their correspond-ing GQLs to infer the global properties; and 3) blind qualitypooling are achieved by searching the optimal GRF and LRFindexes from the learnt LQLs and GQLs, respectively, and thefinal quality score is obtained by combining the LRF and GRFindexes.

Although the proposed scheme exhibit good performancein evaluating the quality of stereoscopic images, some aspectsstill deserve further research and improvement: 1) since theproposed scheme does not well account for the asymmetricaldistortion in the training stage, the connectivity between leftand right views should be further considered and 2) it is valu-able to design a more effective FR quality estimation model, sothat the relevance between the left-right sparse properties andthe quality scores in the training stage can be characterized.

Page 12: IEEE TRANSACTIONS ON CYBERNETICS 1 Learning Receptive

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON CYBERNETICS

REFERENCES

[1] D. C. Mocanu, G. Exarchakos, and A. Liotta, “Deep learning forobjective quality assessment of 3D images,” in Proc. Int. Conf. ImageProcess., Paris, France, Oct. 2014, pp. 758–762.

[2] S. Winkler, “Efficient measurement of stereoscopic 3D video con-tent issues,” Proc. SPIE Image Qual. Syst. Perform. XI, vol. 9016,Art. ID 90160Q, Jan. 2014.

[3] W. Lin and C. C. J. Kuo, “Perceptual visual quality metrics: A sur-vey,” J. Vis. Commun. Image Represent., vol. 22, no. 4, pp. 297–312,May 2011.

[4] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image qualityassessment: From error visibility to structural similarity,” IEEE Trans.Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004.

[5] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multi-scale structuralsimilarity for image quality assessment,” in Proc. IEEE Asilomar Conf.Signals Syst. Comput., Pacific Grove, CA, USA, 2003, pp. 1398–1402.

[6] Z. Wang and A. C. Bovik, “A universal image quality index,”IEEE Signal Process. Lett., vol. 9, no. 3, pp. 81–84, Mar. 2002.

[7] H. R. Sheikh and A. C. Bovik, “Image information and visual quality,”IEEE Trans. Image Process., vol. 15, no. 2, pp. 430–444, Feb. 2006.

[8] A. Moorthy and A. C. Bovic, “A two-step framework for constructingblind image quality indices,” IEEE Signal Process. Lett., vol. 17, no. 5,pp. 513–516, May 2010.

[9] M. A. Saad, A. C. Bovik, and C. Charrier, “Blind image qualityassessment: A natural scene statistics approach in the DCT domain,”IEEE Trans. Image Process., vol. 21, no. 8, pp. 3339–3352, Aug. 2012.

[10] A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image qualityassessment in the spatial domain,” IEEE Trans. Image Process., vol. 21,no. 12, pp. 4695–4708, Dec. 2012.

[11] A. K. Moorthy and A. C. Bovik, “Blind image quality assessment:From natural scene statistics to perceptual quality,” IEEE Trans. ImageProcess., vol. 20, no. 12, pp. 3350–3364, Dec. 2011.

[12] X. Gao, F. Gao, D. Tao, and X. Li, “Universal blind image quality assess-ment metrics via natural scene statistics and multiple kernel learning,”IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 12, pp. 2013–2026,Dec. 2013.

[13] W. Hou, X. Gao, D. Tao, and X. Li, “Blind image quality assess-ment via deep learning,” IEEE Trans. Neural Netw. Learn. Syst., 2014.Doi: 10.1109/TNNLS.2014.2336852.

[14] L. Kang, P. Ye, Y. Li, and D. Doermann, “Convolutional neural networksfor no-reference image quality assessment,” in Proc. Int. Conf. Comput.Vis. Pattern Recognit., Columbus, OH, USA, 2014, pp. 1733–1740.

[15] K. Gu, G. Zhai, X. Yang, and W. Zhang, “Deep learning network forblind quality assessment,” in Proc. Int. Conf. Image Process., Paris,France, Oct. 2014, pp. 511–515.

[16] A. Mittal, G. S. Muralidhar, J. Ghosh, and A. C. Bovik, “Blind imagequality assessment without human training using latent quality factors,”IEEE Signal Process. Lett., vol. 19, no. 2, pp. 75–78, Feb. 2012.

[17] W. Xue, L. Zhang, and X. Mou, “Learning without human scores forblind image quality assessment,” in Proc. IEEE Conf. Comput. Vis.Pattern Recognit., Portland, OR, USA, Jun. 2013, pp. 995–1002.

[18] P. Lebreton, A. Raake, M. Barkowsky, and P. Le Callet, “Evaluatingdepth perception of 3D stereoscopic videos,” IEEE J. Sel. Topics SignalProcess., vol. 6, no. 6, pp. 710–720, Oct. 2012.

[19] W. J. Tam, F. Speranza, S. Yano, K. Shimono, and H. One, “Stereoscopic3D-TV: Visual comfort,” IEEE Trans. Broadcast., vol. 57, no. 2,pp. 335–346, Jun. 2011.

[20] L. E. Coria, D. Xu, and P. Nasiopoulos, “Quality of experience of stereo-scopic content on displays of different sizes: A comprehensive subjectiveevaluation,” in Proc. IEEE Int. Conf. Consum. Electron., Las Vegas, NV,USA, Jan. 2011, pp. 755–758.

[21] P. Gorley and N. Holliman, “Stereoscopic image quality metrics andcompression,” Proc. SPIE Stereoscop. Displays Appl. XIX, vol. 6803,Feb. 2008, Art. ID 680305.

[22] A. Boev, A. Gotchev, K. Egiazarian, A. Aksay, and G. B. Akar, “Towardscompound stereo-video quality metric: A specific encoder-based frame-work,” in Proc. IEEE Southwest Symp. Image Anal. Interpret., Denver,CO, USA, 2006, pp. 218–222.

[23] A. Benoit, P. Le Callet, P. Campisi, and R. Cousseau, “Quality assess-ment of stereoscopic images,” EURASIP J. Image Video Process.,vol. 2008, Jan. 2009, Art. ID 659024.

[24] J. You, L. Xing, A. Perkis, and X. Wang, “Perceptual quality assessmentfor stereoscopic images based on 2D image quality metrics and disparityanalysis,” in Proc. Int. Workshop Video Process. Qual. Metrics Consum.Electron., Scottsdale, AZ, USA, 2010, pp. 61–66.

[25] S. L. P. Yasakethu, C. T. E. R. Hewage, W. A. C. Fernando, andA. M. Kondoz, “Quality analysis for 3D video using 2D video qualitymodels,” IEEE Trans. Consum. Electron., vol. 54, no. 4, pp. 1969–1976,Nov. 2008.

[26] J. J. Hwang and H. R. Wu, “Stereo image quality assessment using visualattention and distortion predictors,” KSII Trans. Internet Inf. Syst., vol. 5,no. 9, pp. 1613–1631, Sep. 2011.

[27] X. Wang, S. Kwong, and Y. Zhang, “Considering binocular spatial sen-sitivity in stereoscopic image quality assessment,” in Proc. IEEE Vis.Commun. Image Process., Tainan, Taiwan, Nov. 2011, pp. 1–4.

[28] R. Bensalma and M. C. Larabi, “A perceptual metric for stereoscopicimage quality assessment based on the binocular energy,” Multidim. Syst.Signal Process., vol. 24, no. 2, pp. 281–316, Jun. 2013.

[29] A. Maalouf and M. C. Larabi, “CYCLOP: A stereo color image qualityassessment metric,” in Proc. IEEE Int. Conf. Acoust. Speech SignalProcess., Prague, Czech Republic, May 2011, pp. 1161–1164.

[30] L. Jin, A. Boev, A. Gotchev, and K. Egiazarian, “3D-DCT based per-ceptual quality assessment of stereo video,” in Proc. IEEE Int. Conf.Image Process., Brussels, Belgium, Sep. 2011, pp. 2521–2524.

[31] F. Shao, W. Lin, S. Gu, G. Jiang, and T. Srikanthan, “Perceptualfull-reference quality assessment of stereoscopic images by consideringbinocular visual characteristics,” IEEE Trans. Image Process., vol. 22,no. 5, pp. 1940–1953, May 2013.

[32] V. De Silva, H. K. Arachchi, E. Ekmekcioglu, and A. Kondoz, “Towardsan impairment metric for stereoscopic video: A full-reference videoquality metric to assess compressed stereoscopic video,” IEEE Trans.Image Process., vol. 22, no. 9, pp. 3392–3404, Sep. 2013.

[33] M.-J. Chen, C.-C. Su, D.-K. Kwon, L. K. Cormack, and A. C. Bovik,“Full-reference quality assessment of stereopairs accounting for rivalry,”Signal Process. Image Commun., vol. 28, no. 9, pp. 1143–1155,Oct. 2013.

[34] Y. Lin and J. Wu, “Quality assessment of stereoscopic 3D imagecompression by binocular integration behaviors,” IEEE Trans. ImageProcess., vol. 23, no. 4, pp. 1527–1542, Apr. 2014.

[35] L. Jin, A. Boev, K. Egiazarian, and A. Gotchev, “Quantifying theimportance of cyclopean view and binocular rivalry-related features forobjective quality assessment of mobile 3D video,” EURASIP J. ImageVideo Process., vol. 2014, no. 6, 2014, pp. 1–18.

[36] S. Ryu and K. Sohn, “No-reference quality assessment for stereoscopicimages based on binocular quality perception,” IEEE Trans. CircuitsSyst. Video Technol., vol. 24, no. 4, pp. 591–602, 2013.

[37] M.-J. Chen, L. K. Cormack, and A. C. Bovik, “No-reference qualityassessment of natural stereopairs,” IEEE Trans. Image Process., vol. 22,no. 9, pp. 3379–3391, Sep. 2013.

[38] K. Gu, G. Zhai, X. Yang, and W. Zhang, “No-reference stereoscopicIQA approach: From nonlinear effect to parallax compensation,”J. Electr. Comput. Eng., vol. 2012, Sep. 2012, Art. ID 436031.

[39] K. Gu, G. Zhai, X. Yang, and W. Zhang, “A new no-referencestereoscopic image quality assessment based on ocular dominance theoryand degree of parallax,” in Proc. Int. Conf. Pattern Recognit., Tsukuba,Japan, Nov. 2012, pp. 206–209.

[40] Z. M. P. Sazzad, R. Akhter, J. Baltes, and Y. Horita, “Objectiveno-reference stereoscopic image quality prediction based on 2D imagefeatures and relative disparity,” Adv. Multimedia, vol. 2012, Jan. 2012,Art. ID 256130.

[41] D. M. Chandler, “Seven challenges in image quality assessment:Past, present, and future research,” ISRN Signal Process., vol. 2013,Art. ID 905685, 2013.

[42] M. Lambooij, W. IJsselsteijn, D. G. Bouwhuis, and I. Heynderickx,“Evaluation of stereoscopic images: Beyond 2D quality,” IEEE Trans.Broadcast., vol. 57, no. 2, pp. 432–444, Jun. 2011.

[43] R. Blake and H. Wilson, “Binocular vision,” Vis. Res., vol. 51, no. 7,pp. 754–770, Apr. 2011.

[44] B. A. Olshausen and D. J. Field, “Emergence of simple-cell receptivefield properties by learning a sparse code for natural images,” Nature,vol. 381, no. 6583, pp. 607–609, 1996.

[45] F. Zhang, W. Jiang, F. Autrusseau, and W. Lin, “Exploring V1 by mod-eling the perceptual quality of images,” J. Vis., vol. 14, no. 1, p. 26,Jan. 2014.

[46] H. W. Chang, H. Yang, Y. Gan, and M. H. Wang, “Sparse feature fidelityfor perceptual image quality assessment,” IEEE Trans. Image Process.,vol. 22, no. 10, pp. 4007–4018, Oct. 2013.

[47] T. Guha and R. Ward, “Learning sparse models for image qualityassessment,” in Proc. IEEE Int. Conf. Acoust. Speech SignalProcess. (ICASSP), Florence, Italy, May 2014, pp. 151–155.

Page 13: IEEE TRANSACTIONS ON CYBERNETICS 1 Learning Receptive

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SHAO et al.: LEARNING RFs AND QUALITY LOOKUPS FOR BLIND QUALITY ASSESSMENT OF STEREOSCOPIC IMAGES 13

[48] J. J. Hunt, P. Dayan, and G. J. Goodhill, “Sparse coding can predictprimary visual cortex receptive field changes induced by abnormal visualinput,” PLoS Comput. Biol., vol. 9, no. 5, May 2013, Art. ID e1003005.

[49] D. Stidwill and R. Fletcher, “Normal binocular vision: Theory, inves-tigation and practical aspects,” 1st ed. New York, NY, USA: Wiley,2010.

[50] Y. D. Zhu and N. Qian, “Binocular receptive field models, disparitytuning, and characteristic disparity,” Neural Comput., vol. 8, no. 8,pp. 1611–1641, 1996.

[51] D. J. Fleet, H. Wagner, and D. J. Heeger, “Neural encoding of binoculardisparity: Energy models, position shifts and phase shifts,” Vis. Res.,vol. 36, no. 12, pp. 1839–1857, Jun. 1996.

[52] Q. Peng and B. E. Shi, “The changing disparity energy model,” Vis. Res.,vol. 50, no. 2, pp. 181–192, 2010.

[53] S. B. Steinman, B. A. Steinman, and R. P. Garzia, Foundationsof Binocular Vision: A Clinical Perspective. New York, NY, USA:McGraw-Hill, 2000.

[54] B. A. Olshausen and D. J. Field, “Sparse coding with an overcom-plete basis set: A strategy employed by V1?” Vis. Res., vol. 37, no. 23,pp. 3311–3325, 1997.

[55] A. Hyvärinena and P. O. Hoyera, “A two-layer sparse coding modellearns simple and complex cell receptive fields and topography fromnatural images,” Vis. Res., vol. 41, no. 18, pp. 2413–2423, 2001.

[56] H. Shouval, N. Intrator, C. C. Law, and L. N. Cooper, “Effect ofbinocular cortical misalignment on ocular dominance and orientationselectivity,” Neural Comput., vol. 8, no. 5, pp. 1021–1040, 1996.

[57] A. Anzai, I. Ohzawa, and R. D. Freeman, “Neural mechanisms for pro-cessing binocular information I. Simple cells,” J. Neurophysiol., vol. 82,no. 2, pp. 891–908, 1999.

[58] A. Hyvärinen, “Fast and robust fixed-point algorithms for indepen-dent component analysis,” IEEE Trans. Neural Netw., vol. 10, no. 3,pp. 626–634, May 1999.

[59] A. J. Bell and T. J. Sejnowski, “An information-maximization approachto blind separation and blind deconvolution,” Neural Comput., vol. 7,no. 6, pp. 1129–1159, 1995.

[60] C. Zetzsche, G. Krieger, and B. Wegmann, “The atoms of vision:Cartesian or polar?” J. Opt. Soc. America A, vol. 16, no. 7,pp. 1554–1565, 1999.

[61] D. H. Hubel, “The visual cortex of the brain,” Sci. Amer., vol. 209, no. 5,pp. 54–63, Nov. 1963.

[62] G. C. DeAngelis, I. Ohzawa, and R. D. Freeman, “Neuronal mechanismsunderlying stereopsis: How do simple cells in the visual cortex encodebinocular disparity?” Perception, vol. 24, no. 1, pp. 3–31, 1995.

[63] H. S. Smallman and D. I. A. MacLeod, “Size-disparity correlation instereopsis at contrast threshold,” J. Opt. Soc. America A, vol. 11, no. 8,pp. 2169–2183, 1994.

[64] L. He, D. Tao, X. Li, and X. Gao, “Sparse representation for blind imagequality assessment,” in Proc. Int. Conf. Comput. Vis. Pattern Recognit.,Providence, RI, USA, 2012, pp. 1146–1153.

[65] J. P. Pluim, J. B. Maintz, and M. A. Viergever, “Image registra-tion by maximization of combined mutual information and gradientinformation,” IEEE Trans. Med. Imag., vol. 19, no. 8, pp. 809–814,Aug. 2000.

[66] M. W. Spratling, “Classification using sparse representations: A biolog-ically plausible approach,” Biol. Cybern., vol. 108, no. 1, pp. 61–73,2014.

[67] A. Macy, I. Ohzawa, and R. D. Freeman, “A quantitative study ofthe classification and stability of ocular dominance in the cat’s visualcortex,” Exp. Brain Res., vol. 48, no. 3, pp. 401–408, 1982.

[68] A. K. Moorthy, C. C. Su, A. Mittal, and A. C. Bovik, “Subjective eval-uation of stereoscopic image quality,” Signal Process. Image Commun.,vol. 28, no. 8, pp. 870–883, Dec. 2013.

[69] K. Gu, G. Zhai, X. Yang, and W. Zhang, “Hybrid no-referencequality metric for singly and multiply distorted images,” IEEE Trans.Broadcast., vol. 60, no. 3, pp. 555–567, Sep. 2014.

[70] P. G. Gottschalk and J. R. Dunn, “The five-parameter logistic:A characterization and comparison with the four-parameter logistic,”Anal. Biochem., vol. 343, no. 1, pp. 54–65, Aug. 2005.

[71] L. Li et al., “No-reference image blur assessment based ondiscrete orthogonal moments,” IEEE Trans. Cybern., 2015.Doi: 10.1109/TCYB.2015.2392129.

[72] K. Gu, G. Zhai, W. Lin, and M. Liu, “The analysis of image contrast:From quality assessment to automatic enhancement,” 2015. IEEE Trans.Cybern., Doi: 10.1109/TCYB.2015.2401732.

[73] P. Ye, J. Kumar, and D. Doermann, “Beyond human opinion scores blindimage quality assessment based on synthetic scores,” in Proc. IEEEConf. Comput. Vis. Pattern Recognit., Columbus, OH, USA, 2014,pp. 4241–4248.

[74] Y. Liu, Z. Yang, and L. Yang, “Online signature verification basedon DCT and sparse representation,” IEEE Trans. Cybern., 2014.Doi: 10.1109/TCYB.2014.2375959.

Feng Shao received the B.S. and Ph.D. degreesfrom Zhejiang University, Hangzhou, China,in 2002 and 2007, respectively, both in electronicscience and technology.

He was a Visiting Fellow with the School ofComputer Engineering, Nanyang TechnologicalUniversity, Singapore, from February 2012 toAugust 2012. He is currently an Full Professor withthe Faculty of Information Science and Engineering,Ningbo University, Ningbo, China. His currentresearch interests include 3D video coding, 3D

quality assessment, and image perception.

Weisi Lin (M’92–SM’98) received the B.Sc.and M.Sc. degrees from Zhongshan University,Guangzhou, China, and the Ph.D. degree fromKing’s College London, London, U.K.

He was the Laboratory Head of VisualProcessing, and the Acting Department Manager ofMedia Processing, with the Institute for InfocommResearch, Singapore. He has been elected asa Distinguished Lecturer of APSIPA (2012/2013).He is currently an Associate Professor withthe School of Computer Engineering, Nanyang

Technological University, Singapore. His current research interests includeimage processing, perceptual modeling, video compression, multimedia com-munication, and computer vision. He has published over 200 refereed papersin international journals and conferences.

Dr. Lin has served as the Lead Guest Editor for the IEEE JOURNAL OF

SELECTED TOPICS IN SIGNAL PROCESSING Special Issue on PerceptualSignal Processing in 2012. He is the Chair of the IEEE MMTC SpecialInterest Group on Quality of Experience. He is the Lead Technical ProgramChair for the Pacific-Rim Conference on Multimedia (PCM) in 2012 anda Technical Program Chair for the IEEE International Conference onMultimedia and Expo (ICME) in 2013. He is on the Editorial Boards ofthe IEEE TRANSACTIONS ON MULTIMEDIA, IEEE SIGNAL PROCESSING

LETTERS, and Journal of Visual Communication and Image Representation.He is a Chartered Engineer, U.K., a fellow of Institution of EngineeringTechnology, and an Honorary Fellow, Singapore Institute of EngineeringTechnologists.

Shanshan Wang received the B.S. degree from theShijiazhuang Institute of Economics, Shijiazhuang,China, in 2012. She is currently pursuing the M.S.degree from Ningbo University, Ningbo, China.

Her current research interests include image/videoprocessing and quality assessment.

Gangyi Jiang (A’04) received the M.S. degreefrom Hangzhou University, Zhejiang, China, and thePh.D. degree from Ajou University, Suwon, Korea,in 1992 and 2000, respectively.

He is currently a Professor with the Facultyof Information Science and Engineering, NingboUniversity, Ningbo, China. His current researchinterests include digital video compression andmultiview video coding.

Page 14: IEEE TRANSACTIONS ON CYBERNETICS 1 Learning Receptive

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

14 IEEE TRANSACTIONS ON CYBERNETICS

Mei Yu received the M.S. degree from the HangzhouInstitute of Electronics Engineering, Hangzhou,China, and the Ph.D. degree from Ajou University,Suwon, Korea, in 1993 and 2000, respectively.

She is currently a Professor with the Facultyof Information Science and Engineering, NingboUniversity, Ningbo, China. Her current researchinterests include image/video coding and videoperception.

Qionghai Dai (SM’05) received the B.S. degreefrom Shaanxi Normal University, Xi’an, China,in 1987, and the M.E. and Ph.D. degreesfrom Northeastern University, Shenyang, China,in 1994 and 1996, respectively.

Since 1997, he has been with the faculty ofTsinghua University, Beijing, China, where he is cur-rently a Professor and the Director of the BroadbandNetworks and Digital Media Laboratory. His cur-rent research interests include video communication,computer vision, and computational photography.