[ieee 2013 ieee international symposium on broadband multimedia systems and broadcasting (bmsb) -...

Objective Estimation of 3D Video Quality: ADisparity-based Weighting Strategy

Carlos Danilo Miranda Regis∗†‡, IEEE Student Member, Jose Vinıcius de Miranda Cardoso†‡,Italo de Pontes Oliveira∗, IEEE Student Member, and Marcelo Sampaio de Alencar†‡, IEEE Senior Member

∗Federal Institute of Education, Science and Technology of Paraıba – IFPB†Federal University of Campina Grande – UFCG

‡Institute for Advanced Studies in Communications – IecomE-mails: {danilo, josevinicius, malencar}@iecom.org.br and [email protected]

Telephone: +55 (83) 2101–1578Campina Grande, Brazil 58429-900

Abstract—This paper presents new results on the objectiveevaluation of stereoscopic video quality. A balance between thescores provided by classical objective algorithms and the disparityinformation of the reference video is found. The performanceof the proposed technique was verified using statistical met-rics (correlation coefficients and root mean square error) thatcompare the scores provided by the objective algorithms andthe subjective results provided by the NAMA3DS1–COSPAD1stereoscopic video quality database. The obtained results suggesta significant improvement on the performance of the objectivealgorithms for the proposed technique.

Index Terms—Performance Evaluation, Objective EvaluationTechniques, 3-D and Multi-view Videos

I. INTRODUCTION

Objective evaluation is a fast alternative to time-consumingsubjective evaluations. Objective algorithms were developedto evaluate the 2D video quality, but the 3D video has anew component that needs to be considered in the design ofalgorithms: the depth. Recently, the disparity has been used asa depth estimation [1].

Objective algorithms are computational models, which usestatistical characteristics of the video combined with featuresof the Human Visual System (HVS), to estimate the qualityscore, classified according to the availability of the origi-nal signals as: full reference, in which the original videois compared with the video under test (degraded); reducedreference, whenever only characteristics of the original videoare available for comparison with the video under test, andno reference, in which only the video under test is used forquality assessment.

This paper presents an investigation on the full referenceobjective algorithms for 3D video quality measurement whenthey are combined with the disparity present in the referencevideo. The focus of this research is to verify the impact of thedisparity on the visual attention and its consequence to theperceived 3D video quality.

Visual attention is a cognitive ability that involves search,selection and focus of relevant stimuli [2]. Experiments indi-cate that the human visual attention is not equally distributedthroughout the image environment, but concentrates in a fewregions [3]. Several approaches [4], [5], [6], [7], [8] indicate

that the inclusion of methods to identify the visual attentionof a scene, i.e., assign a weight to the visual importance ofregions on the image, enhances the evaluation provided by theobjective metrics.

Thus, the proposed approach combines a weighting, thatis a function of the disparity of the 3D original video, withthe score provided by objective algorithms, to improve theassessment of the 3D video quality. The statistical metricsused to compare the performance of the proposed approachwith the classical algorithms were: Pearson Linear Correla-tion Coefficient (PLCC), Spearman Rank-Order CorrelationCoefficient (SROCC), Kendall Rank-Order Correlation Coef-ficient (KROCC) and Root Mean Square Error (RMSE). TheConfidence Intervals for the PLCC are also presented.

The remaining of this paper is organized as follows. SectionII describes some objective algorithms for 2D video qualityused to predict the 3D video quality. Section III presents thecontribution of this paper: a combination between the disparityinformation contained in the reference video and the objectivealgorithms measurements. The subjective evaluation with 3Dvideo sequences, performed by NAMA3D1-COSPAD1 [9] isreviewed in Section IV. The discussion on the performance ofthe proposed algorithm is presented in Section V. Section VIpresents the conclusions.

II. OBJECTIVE APPROACHES

Let V = [vL(x, y, n), vR(x, y, n)] be a 3D video sample, inwhich the scalar functions vL and vR correspond to the leftand right views, respectively; (x, y) represents the rectangularspatial coordinates and n represents the frame number. Letyet F and H be 3D reference video and the video undertest, respectively. A full reference objective algorithm for 3Dvideo quality assessment is a function G, such that its image(G(F,H)) represents the quality of H with respect to F .

The following convention is used in this paper

G(F,H) =G(fL, hL) + G(fR, hR)

2, (1)

since the importance of the left and right views is the same.

A. Peak Signal-to-Noise Ratio

Let f(x, y, n) and h(x, y, n) be scalar functions that rep-resent 2D video sequences. The Mean Square Error (MSE)between the signals is computed as

MSE(f, h) =1

N

1

X

1

Y

N�

n=1

X�

x=1

Y�

y=1

[f(x, y, n)− h(x, y, n)]2.

(2)The Peak Signal-to-Noise Ratio (PSNR) is computed as

PSNR(f, h) = 20 · log10

�MAX�

MSE(f, h)

�dB, (3)

in which MAX is the maximum value of the gray scale andMSE(f, h) is the Mean Square Error between f and h.

B. Structural Similarity Index

The Structural SIMilarity (SSIM) [10] is a full-referenceapproach to image and video quality assessment based onthe assumption that the HVS is highly adapted to recognizestructural information in the visual environment and, therefore,the changes in the structural information provide a goodapproximation to the quality perceived by the human visualsystem.

The SSIM(f, h) is computed as a product of three measuresover the luminance plane: luminance comparison l(f, h), thecontrast comparison c(f, h) and the structural comparisons(f, h):

l(f, h) =2µfµh + C1

µ2f + µ2

h + C1, (4)

c(f, h) =2σfσh + C2

σ2f + σ2

h + C2, (5)

s(f, h) =σfh + C3

σfσh + C3, (6)

in which µ is the sample average, σ is the sample standarddeviation, σfh is the covariance, C1 = (0.01 · 255)2, C2 =(0.03 · 255)2 and C3 = C2

2 .The structural similarity index is described as

SSIM(f, h) = [l(f, h)]α · [c(f, h)]β · [s(f, h)]γ , (7)

in which usually α = β = γ = 1.In practice the SSIM is computed for an 8×8 sliding squared

window or for an 11×11 Gaussian-circular window. The firstapproach is used in this paper. Then, for two videos which aresubdivided into J blocks, the SSIM is computed as

SSIM(f, h) =1

J

J�

j=1

SSIM(fj , hj). (8)

C. Perceptual Weighted Structural Similarity Index

Regis et al. [11] proposed a technique called PerceptualWeighting (PW), which combines the local Spatial PerceptualInformation (SI), as a visual attention estimator, with theSSIM, since experiments indicate that the quality perceivedby the HVS is more sensitive in areas of intense visualattention [7]. The SI is computed using the Sobel differentialoperator, which estimates the magnitude of the gradient vectorsof the video.

The PW technique uses the local spatial perceptual infor-mation to weigh the most visually important regions. Thisweighting is obtained as follows: compute the magnitude ofthe gradient vectors in the original video by means of theSobel masks, then generate a perceptual map in which thepixel values are the magnitude of the gradient vectors. Theframe is partitioned into blocks 8 × 8 pixels, and the SI ineach block is computed as

SI(fj) =

�� 1

K − 1

K�

k=1

(µj − |∇f(k)|)2, (9)

in which, µj represents the sample average of the perceptualmap in a j-block and K is a total of gradient vectors in j-thblock. For the case that the frames are partitioned uniformlyin squares 8×8, K = 64. The Perceptual Weighted StructuralSimilarity Index (PW–SSIM) is computed as

PW–SSIM(f, h) =

�Jj=1 SSIM(fj , hj) · SI(fj)

�Jj=1 SI(fj)

. (10)

III. DISPARITY WEIGHTING STRATEGY

The disparity that is present in a 3D video sequence is aninformation related to the sense of the stereo perception [1].This information is computed as the difference between twocorresponding pixels in the left and right views. Indeed, asis well know, the disparity should be considered in the de-velopment of objective algorithms, to improve the correlationbetween the objective prediction and the subjective scores.

The disparity map, D(F ), is computed as

D(F (x, y, n)) = |fL(x, y, n)− fR(x, y, n)|, ∀ (x, y, n).(11)

The introduction of the disparity information in the 2D ob-jective metrics uses the weighted average of the objectivemeasurements with the disparity map. This approach wasimplemented in two objective metrics, PSNR and SSIM,producing the DPSNR and DSSIM.

The DPSNRL, i.e., DPSNR for the left view, is computedas

DPSNRL(F,H) = 20 · log10

�MAX�

DMSEL(F,H)

�dB. (13)

The DMSE and the DPSNR for the right view (DMSER andDPSNRR) are computed in the same manner. Then the overallDPSNR is the average of the DPSNRL and DPSNRR.

DMSEL(F,H) =

N�

n=1

X�

x=1

Y�

y=1

[f(x, y, n)− h(x, y, n)]2 · D(F (x, y, n))

N�

n=1

X�

x=1

Y�

y=1

D(F (x, y, n))

(12)

The DSSIM is computed as

DSSIM(F,H) =

�Jj=1 SSIM(Fj , Hj) · D(Fj)

�Jj=1 D(Fj)

, (14)

in which D(Fj) is the average disparity contained in block j.The DPW–SSIM is computed as

DPW–SSIM(F,H) =

�Jj=1 SSIM(Fj , Hj) · SI(Fj) · D(Fj)

�Jj=1 [SI(Fj) · D(Fj)]

.

(15)

IV. SUBJECTIVE EXPERIMENTS: NAMA3DS1-COSPAD1The NAMA3DS1-COSPAD1 [12] [9] stereoscopic video

quality database provides subjective results, using the Ab-solute Category Rating with Hidden Reference (ACR-HR)method [13], considering scenarios restricted to coding andspatial degradations, including: H.264 and JPEG2000 coding.Table I summarizes the types and levels of degradation pa-rameters used for subjective experiments performed in theNAMA3DS1-COSPAD1.

The diversity of the spatial and temporal features of thevideo samples available in the NAMA3DS1-COSPAD1 werequantified using the Spatial Perceptual Information and Tem-poral Perceptual Information (TI). The ITU-T Recommenda-tion P.910 [14] suggests that, to form a set of video samples fora subjective evaluation some video parameters are taken intoaccount, in particular the SI and TI, to avoid that the subjectiveevaluation becomes tiring and boring for the observers. It isimportant that the chosen videos present a variety of valuesof SI and TI.

Fig. 1 indicates the heterogeneity of sources available in theNAMA3DS1-COSPAD1, in which case each point representsa reference stereoscopic video.

0

10

20

30

40

50

60

40 60 80 100 120 140

Tem

pora

l Per

cept

ual I

nfor

mat

ion

(TI)

Spatial Perceptual Information (SI)

Barrier GateBasketBoxers

HallLab

News ReportPhone Call

SoccerTree Branches

Umbrella

Fig. 1: Spatial and temporal diversity found in the videosequences available in NAMA3DS1-COSPAD1.

TABLE I: Types of coding and their parameters available inNAMA3DS1-COSPAD1

Type ParametersQP = 32

Video Coding (H.264) QP = 38QP = 442 Mbits/s

Image Coding (JPG2K) 8 Mbits/s16 Mbits/s32 Mbits/s

V. SIMULATION RESULTS

The performance of the proposed algorithms was measuredusing the following estimators: Pearson Linear CorrelationCoefficient (PLCC), Spearman Rank-Order Correlation Co-efficient (SROCC), Kendall Rank-Order Correlation Coeffi-cient (KROCC) and Root Mean Square Error (RMSE). Theseestimators were used to assess the accuracy, monotonicityand consistency of the objective model prediction with re-spect to human subjective scores available from NAMA3DS1-COSPAD1, allowing the comparison and validation of theperformance of the proposed algorithms with state of the artalgorithms.

The PLCC, SROCC, KROCC and RMSE were computedafter performing a non-linear regression on the objective videoquality assessment algorithmic measures, using a four param-eter monotonic cubic polynomial function to fit the objectiveprediction to the subjective quality scores. The function is thefollowing [15],

DMOS(p)t = β 1 + β 2 ·Qt + β 3 ·Q2

t + β 4 ·Q3t , (16)

in which Qt represents the quality that a video quality assess-ment algorithm predicts for the t-th video in the NAMA3DS1-COSPAD1 Video Quality Database. The non-linear leastsquares optimization is performed using the MATLAB R� func-tion nlinfit to find the optimal parameter β that minimizesthe least squares error between the subjective scores (DMOSt)and the fitted objective scores (DMOS(p)

t ). The MATLAB R�function nlpredci was used to obtain the DMOS predictedscores, after the least squares optimization. The results of thestatistical measures are presented in Table II, the two bestresults are shown in boldface.

The results shown in Table IIa, considering the degradationsintroduced by the H.264 coding, indicate a significant improve-ment in the performance of the objective algorithms when theyare combined with the disparity information contained in theoriginal video.

TABLE II: Performance measures of the objective algorithms.

(a) H.264 scenario.

Algorithm PLCC SROCC KROCC RMSEPSNR 0.774946 0.721424 0.533869 0.689299SSIM 0.730523 0.716222 0.555117 0.744770

PW–SSIM 0.915983 0.906776 0.756978 0.437573DPSNR 0.863640 0.838604 0.640111 0.549789DSSIM 0.901635 0.892266 0.746354 0.471688

DPW–SSIM 0.954403 0.937166 0.815412 0.325572

(b) JPEG2000 scenario.



DPW–SSIM 0.975911 0.971048 0.865991 0.285951

(c) H.264–JPEG2000 joint scenario.



DPW–SSIM 0.967001 0.955609 0.830147 0.312082

The DSSIM metric presents an increase of 23.4% of accu-racy (PLCC), 24.6% and 34.4% of monotonicity (SROCC andKROCC) and a decrease of 36.7% in the RMSE (increase ofconsistency), with respect to the performance of the SSIM.

The performance of the objective algorithms for theJPEG2000 scenario (shown in Table IIb) was similar. The DP-SNR produces the highest improvement, of 10.4% of accuracy,12.3% and 16.3% (SROCC and KROCC) of monotonicity and27.6% of consistency, with respect to the performance of thePSNR.

The results for the H.264–JPEG2000 joint scenario arepresented in Table IIc. The DPW–SSIM achieved the bestperformance, for all the scenarios, since this algorithm usesthe perceptual combined with the disparity weighting.

Table III summarizes the percentage of improvement of theobjective algorithms for the proposed technique in relation tothe classical approaches.

Fig. 2 contains the PLCC with a 95% confidence interval forthe H.264, JPEG2000 and H.264–JPEG2000 joint scenarios.The Z Fisher’s transformation was applied to the PLCC (ρ) toproduce the confidence interval. The Z Fisher’s transformationis defined as

Z =1

2loge

�1 + ρ

1− ρ

�, (17)

in which Z follows the Normal distribution with variance

σ2Z =

1

Ns − 3, (18)

and Ns is the total number of samples. The confidence intervalfor this variable is defined as,

IC(z, 1− α) = (Z − z1−α · σZ , Z + z1−α · σZ). (19)

TABLE III: Increase in performance of the objective algo-rithms when the proposed technique was included.

(a) H.264 scenario.

Comparison PLCC SROCC KROCC RMSEDPSNR – PSNR 11.4% 16.2% 19.9% 20.2%DSSIM – SSIM 23.4% 24.6% 34.4% 36.7%

DPW-SSIM – PW-SSIM 4.2% 3.4% 7.7% 25.6%



DPW-SSIM – PW-SSIM 0.4% 0.5% 0.6% 6.4%



DPW-SSIM – PW-SSIM 1.6% 1.3% 3.6% 16.8%

For α = 0.05, i.e., an interval with 95% of confidence, z0.95 =1.96, and Equation 19 is rewritten as

IC(z, 0.95) = (Z − 1.96 · σZ , Z + 1.96 · σZ). (20)

The inverse of the Z Fisher’s transformation is,

ρ =e2z − 1

e2z + 1, (21)

and the confidence interval in terms of ρ is defined as,

IC(ρ, 0.95) =�e2·(Z−1.96·σZ) − 1

e2·(Z−1.96·σZ) + 1,e2·(Z+1.96·σZ) − 1

e2·(Z+1.96·σZ) + 1

�.

(22)Fig. 2 suggests that, besides an increase in the correlation

coefficient, there was a reduction in the length of the PLCCconfidence interval for the algorithms that use the disparityweighting technique.

Fig. 3 presents the trend between the set of subjective scoresand the set of objective results. The observation of the scatterplots indicates a lower dispersion around of the predictioncurve for the algorithms that use the disparity weighting.

VI. CONCLUSIONS

This paper presents initial contributions for the objectivestereoscopic video quality assessment. The proposed algo-rithms use a disparity weighting strategy to take into accountthe stereoscopic information, to improve the evaluation. A reli-able assessment is important for systems and services that usevideo processing techniques. The simulation results suggest asignificant increase on the evaluation capacity of the objectivealgorithms, that uses the proposed technique, in relation tothe classical algorithms. For future work, the authors intendto consider temporal information in the objective evaluationand compare with recent published algorithms for stereoscopicvideo quality assessment.

0.5

0.6

0.7

0.8

0.9

1

PSNR SSIM PWSSIM DPSNR DSSIM DPWSSIM

PLC

C

Objective Algorithms

(a) H.264 scenario.

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1


PLC

C



0.55 0.6

0.65 0.7

0.75 0.8

0.85 0.9

0.95 1


PLC

C



Fig. 2: The 95% confidence intervals for the PLCC.

20 25 30 35 40 45−1

0

1

2

3

4

5

PSNR (dB)

DM

OS

0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1−0.5

0

0.5

1

1.5

2

2.5

3

3.5

4

SSIM

DM

OS

0.5 0.6 0.7 0.8 0.9 1−0.5

0

0.5

1

1.5

2

2.5

3

3.5

4

PWSSIM

DM

OS

15 20 25 30 35 40−0.5

0

0.5

1

1.5

2

2.5

3

3.5

4

DPSNR (dB)

DM

OS

0.5 0.6 0.7 0.8 0.9 1−0.5

0

0.5

1

1.5

2

2.5

3

3.5

4

DSSIM

DM

OS

0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1−0.5

0

0.5

1

1.5

2

2.5

3

3.5

4

DPWSSIM

DM

OS

Fig. 3: Scatter plots of subjective scores (DMOS) versus model prediction. Each sample point represents a 3D test videosample.

ACKNOWLEDGMENT

This research was supported by the NationalCouncil for Scientific and Technological Development(CNPq), by the Federal University of Campina Grande(UFCG/COPELE/PIBIC), and by the Institute for AdvancedStudies in Communications (Iecom).

REFERENCES

[1] A. Benoit, P. L. Callet, P. Campisi, and R. Cousseau, “Quality assess-ment of stereoscopic images,” EURASIP J. Image and Video Processing,vol. 2008, 2008.

[2] M. Corbetta, “Frontoparietal cortical networks for directing attentionand the eye to visual locations: identical, independent, or overlappingneural systems?” Proc Natl Acad Sci U S A, vol. 95, no. 3, pp.831–8, 1998. [Online]. Available: http://www.biomedsearch.com/nih/Frontoparietal-cortical-networks-directing-attention/9448248.html

[3] L. Itti and C. Koch, “Computational modelling of visual attention,”Nature Reviews Neuroscience, vol. 2, no. 3, pp. 194–203, Mar 2001.

[4] C. D. M. Regis, J. V. M. Cardoso, and M. S. Alencar, “Effect ofvisual attention areas on the objective video quality assessment,” inProceedings of the 18th Brazilian symposium on Multimedia and theweb, ser. WebMedia ’12. New York, NY, USA: ACM, 2012, pp. 75–78.[Online]. Available: http://doi.acm.org/10.1145/2382636.2382654

[5] C. D. M. Regis, J. V. M. Cardoso, I. P. Oliveira, and M. S. Alencar,“Performance of the objective video quality metrics with perceptualweighting considering first and second order differential operators,” inProceedings of the 18th Brazilian symposium on Multimedia and theweb, ser. WebMedia ’12. New York, NY, USA: ACM, 2012, pp. 71–74.[Online]. Available: http://doi.acm.org/10.1145/2382636.2382653

[6] W. Y. L. Akamine and M. C. Q. Farias, “Incorporating visual attentionmodels into image quality metrics,” in Proceedings of the Sixth Interna-tional Workshop on Video Processing and Quality Metrics for ConsumerElectronics (VPQM), 2012.

[7] H. Liu and I. Heynderickx, “Studying the added value of visual attentionin objective image quality metrics based on eye movement data,” in16th IEEE International Conference on Image Processing, nov. 2009,pp. 3097 –3100.

[8] Z. Wang and Q. Li, “Information content weighting for perceptual imagequality assessment,” Image Processing, IEEE Transactions on, vol. 20,no. 5, pp. 1185–1198, 2011.

[9] M. Urvoy, M. Barkowsky, R. Cousseau, Y. Koudota, V. Ricorde,P. Le Callet, J. Gutierrez, and N. Garcia, “Nama3ds1-cospad1: Subjec-tive video quality assessment database on coding conditions introducingfreely available high quality 3d stereoscopic sequences,” in Quality ofMultimedia Experience (QoMEX), 2012 Fourth International Workshopon, july 2012, pp. 109 –114.

[10] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assess-ment: from error visibility to structural similarity,” IEEE Transactionson Image Processing, vol. 13, no. 4, pp. 600 –612, april 2004.

[11] C. D. M. Regis, J. V. M. Cardoso, and M. S. Alencar, “Video qualityassessment based on the effect of the estimation of the spatial perceptual

information,” in Proceedings of 30th Brazilian Symposium of Telecom-munications (SBrT’12), 2012.

[12] IRCCyN-IVC, “Nantes-madrid 3d stereoscopic database,”http://www.irccyn.ec-nantes.fr/spip.php?article1052,Julho 2012.

[13] International Telecommunication Union, “Recommendation BT.500-11:Methodology for the subjective assessment of the quality of televisionpictures,” ITU-T, Tech. Rep., 2002.

[14] ——, “Recommendation P.910: Subjective video quality assessmentmethods for multimedia applications,” ITU-T, Tech. Rep., April 2008.

[15] ——, “Objective perceptual assessment of video quality: Full referencetelevision,” ITU-T, Tech. Rep., 2004.

[ieee 2013 ieee international symposium on broadband multimedia systems and broadcasting (bmsb) -...

Documents