[ieee 2009 fifth advanced international conference on telecommunications - venice/mestre, italy...

5
Perceptual Video Quality Assessment Based on Salient Region Detection Cristina Oprea 1 , Ionut Pirnog 1 , Constantin Paleologu 1 , and Mihnea Udrea 1 1,2,3,4 Dept. of Telecommunications, Politehnica University of Bucharest [email protected] , [email protected] , [email protected] , [email protected] Abstract Video based applications and services usually require at some stage a reliable video quality evaluation method that can give an estimate for the human perceived video quality. While most research is performed in the area of human visual system modeling, we propose a quality metric which first estimates the perceptually important areas using the key elements that attract the attention: color contrast, object size, orientation and eccentricity. The visual attention model implemented here performs as a bottom-up attentional mechanism. For the salient areas detected, a distortion measure is then computed using a specialized no-reference metric. We propose an embedded reference-free video quality metric and show that it outperforms the standard peak signal to noise ratio in evaluating the perceived video quality. The results are also shown to correlate with the subjective results obtained for several test sequences. 1. Introduction Objective assessment of video perceptual quality registered several distinct approaches: human visual system model-based quality metrics, specialized quality metrics and human visual attention system model-based metrics. Early human visual system modeling has proven to be a difficult task due to the physiological complexity. In addition, this approach requires at the quality evaluation stage that both sequences are present, the reference and the distorted sequence. Hence, those methods usually belong to the testing algorithms or to the video coding design verifications, not being suitable in real time applications over usual networks. The second approach is distortion-based; the specialized metric looks for a specific artifact in the video sequence and evaluates the level of annoyance introduced by that distortion. This approach does not require the presence of the reference video sequence at the moment the quality assessment takes place. Last approach introduces a visual attention model in order to identify perceptual significant areas in a video frame and then evaluates the perceptual video quality only on those regions. The quality metric used in this case can be a specialized metric (a no-reference metric) or a Structural Similarity index SSIM [1]. The last approach has the advantage of processing less information since it takes only the regions found as perceptually important, thus making the quality evaluation a faster process. This article focuses on the last approach described and proposes a quality evaluation method based on a visual attention model. Previous work in the field of visual attention modeling can be classified in three general algorithm categories: pixel-based methods, frequency space methods and region-based algorithms [2]. Although visual assessment task in humans seems simple, it actually involves a collection of very complex mechanisms that are not completely understood. The visual attention process can be reduced at two physiological mechanisms that combined together result in a usual selection of perceptual significant areas from a natural or artificial scene. Those mechanisms are bottom-up attentional selection and top-down attentional selection. The first mechanism is an automated selection performed very fast, being driven by the visual stimulus itself. The second one is started in the higher cognitive areas of the brain and it is driven by the individual preferences and interests. A complete simulation of both mechanisms can result in a tremendously complex and time-consuming algorithm. Our work proposes a model for the bottom-up attentional selection mechanism and shows that it offers a good estimate for the perceptual important areas. The paper is structured in four sections. Section 2 is trying to draw a general picture regarding the previous work done in order to find a computational model for the human visual attention system. Section 3 contains a detailed presentation of the proposed algorithm, while the last two sections summarize the experimental results, the conclusions and directions for future work. 2009 Fifth Advanced International Conference on Telecommunications 978-0-7695-3611-8/09 $25.00 © 2009 IEEE DOI 10.1109/AICT.2009.46 232

Upload: mihnea

Post on 25-Jan-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Perceptual Video Quality Assessment Based on Salient Region Detection

Cristina Oprea1, Ionut Pirnog1, Constantin Paleologu1, and Mihnea Udrea1 1,2,3,4Dept. of Telecommunications, Politehnica University of Bucharest

[email protected], [email protected], [email protected], [email protected]

Abstract

Video based applications and services usually require at some stage a reliable video quality evaluation method that can give an estimate for the human perceived video quality. While most research is performed in the area of human visual system modeling, we propose a quality metric which first estimates the perceptually important areas using the key elements that attract the attention: color contrast, object size, orientation and eccentricity. The visual attention model implemented here performs as a bottom-up attentional mechanism. For the salient areas detected, a distortion measure is then computed using a specialized no-reference metric. We propose an embedded reference-free video quality metric and show that it outperforms the standard peak signal to noise ratio in evaluating the perceived video quality. The results are also shown to correlate with the subjective results obtained for several test sequences.

1. Introduction

Objective assessment of video perceptual quality registered several distinct approaches: human visual system model-based quality metrics, specialized quality metrics and human visual attention system model-based metrics. Early human visual system modeling has proven to be a difficult task due to the physiological complexity. In addition, this approach requires at the quality evaluation stage that both sequences are present, the reference and the distorted sequence. Hence, those methods usually belong to the testing algorithms or to the video coding design verifications, not being suitable in real time applications over usual networks. The second approach is distortion-based; the specialized metric looks for a specific artifact in the video sequence and evaluates the level of annoyance introduced by that distortion. This approach does not require the presence of the reference video sequence at the moment the quality assessment takes place. Last

approach introduces a visual attention model in order to identify perceptual significant areas in a video frame and then evaluates the perceptual video quality only on those regions. The quality metric used in this case can be a specialized metric (a no-reference metric) or a Structural Similarity index SSIM [1]. The last approach has the advantage of processing less information since it takes only the regions found as perceptually important, thus making the quality evaluation a faster process.

This article focuses on the last approach described and proposes a quality evaluation method based on a visual attention model. Previous work in the field of visual attention modeling can be classified in three general algorithm categories: pixel-based methods, frequency space methods and region-based algorithms [2].

Although visual assessment task in humans seems simple, it actually involves a collection of very complex mechanisms that are not completely understood. The visual attention process can be reduced at two physiological mechanisms that combined together result in a usual selection of perceptual significant areas from a natural or artificial scene. Those mechanisms are bottom-up attentional selection and top-down attentional selection. The first mechanism is an automated selection performed very fast, being driven by the visual stimulus itself. The second one is started in the higher cognitive areas of the brain and it is driven by the individual preferences and interests. A complete simulation of both mechanisms can result in a tremendously complex and time-consuming algorithm. Our work proposes a model for the bottom-up attentional selection mechanism and shows that it offers a good estimate for the perceptual important areas.

The paper is structured in four sections. Section 2 is trying to draw a general picture regarding the previous work done in order to find a computational model for the human visual attention system. Section 3 contains a detailed presentation of the proposed algorithm, while the last two sections summarize the experimental results, the conclusions and directions for future work.

2009 Fifth Advanced International Conference on Telecommunications

978-0-7695-3611-8/09 $25.00 © 2009 IEEE

DOI 10.1109/AICT.2009.46

232

2. Previous work

The usual approach for finding the focus of attention in a scene is building feature maps for that scene, following the feature integration theory developed by Treisman [3]. This theory states that distinct features in a scene are automatically registered by the visual system and coded in parallel channels, before the items in the image are actually identified by the observer. Independent features like orientation, color, spatial frequency, brightness and motion direction are put together in order to construct a single object being in the focus of attention. Pixel-based, spatial frequency and region-based models of visual attention are different methods of building feature maps and extracting saliency.

The pixel-based category is represented by Laurent Itti’s work concerning the emulation of bottom-up and top-down attentional mechanisms [4]. First computational stage consists of separating the visual information into chromatic opponent channels of blue-yellow and red-green along with a achromatic channel for white-black. For the intensity component there are six maps computed as absolute differences between the intensity of the current pixel and its surround at six different resolution scales. The chromatic channels are normalized by the previous determined intensity channel and then double-opponency is computed by center-surround differences across the resolution scales. The model implements an iterative lateral inhibition scheme for each feature map, and at the end integrates all maps in a final saliency map.

Another possibility of building feature maps is by applying different filtering operations in the frequency domain. Most common type of such filtering is done using Gabor filters, in order to extract the orientation from a visual input [4]. Difference of Gaussians filters are also used for center-surround contrast detection. [5] applies the opponent color theory and uses contrast sensitivity functions for high contrast detection.

Last category of visual attention models are the region-based algorithms. In this case it is usually performed a clustering operation like region segmentation on the original image and then feature maps are computed using these clusters. This approach lead to a combination of spatial frequency processing and pixel-based operations [6]. For this type of visual attention modeling it is easy taking into consideration the saliency with respect to size, since it is an important factor in natural visual selection. Discrimination between large and small objects in a

given scene has the advantage of eliminating large background areas and small unnoticeable patches from the final saliency map.

3. Proposed algorithm

The perceptual video quality evaluation method proposed here begins with a selection of the perceptually significant areas from the given frame scene. The visual attention model used here is a modified algorithm inspired from the one described in [2], the main modifications being performed in order to obtain reasonably parameters for the time- and resource- consuming processing method. The simulations for low and medium bit-rate video sequences show promising results for the algorithm implemented with the modified visual attention model from [2].

The innovative modifications made to the original region-based visual attention model include the following processing stages: scene segmentation, intermediary feature maps computation and feature maps integration method for the final saliency map.

3.1. Scene segmentation

Extracting saliency from video sequences is a complex task because it should take into consideration both the spatial extent and dynamic evolution of regions. First stage consists in segmenting each frame into regions of different colors. There is not necessary to identify objects and select regions characterized by one object, so the segmentation is performed as fast and simple as possible. The aim is to obtain areas containing one color or similar colors as individual regions, while a specific area can represent an entire object or just a part of one. The regions identified in this step will then be passed on to the next stage for parameter computation.

Each frame is processed independently of one another and passed through a transform from true color to indexed with a color map attached. This is done because the number of colors in a usual frame is too large to be completely processed without causing computational overheads. The number of colors in the color map will vary according to the chromatic dynamic in the video sequence.

After having minimized the number of colors present in the current frame, each group of pixels localized in a neighbourhood having the same chromaticity is considered an individual region.

233

3.2. Region parameter computation

The region-based model takes a set of regions, R, as input that consists of regions each represented as ��(� ∈ {1 … �}). Each �� contains data regarding location, bounding rectangle, values of color components, and a list of pointers to the immediate neighbors of this region in the same list, denoted by �� . Each algorithm discussed in the following subsections will add further information to the members of R.

First region parameter is the region perimeter factor that is used as scaling factor for the saliency values, in order to eliminate small unnoticeable patches or large background areas. This factor is computed according to [2] as follows:

��� = �� �� ���� − ���� �������� �� − ��� � ���2 + ���� �� − ���� ���2

��� = �1, ��� ��� > 10, ��� ��� < 0��� , ���� where �� represents the image perimeter, ��� is the current region perimeter, and ���� and ���� are percentages from the entire image for detecting large and small regions.

Following parameters are the intensity factor �ℐ�� and the saturation factor ���� , computed for the current region �� taking into consideration the parameters of the neighbour region �� :

���� = �(��) + ���� �2 ∗ (2ℬ − 1) #$��� + $ ∙ �(��)2ℬ − 1 &

�ℐ�� = ℐ(��) + ℐ��� �2 ∗ (2ℬ − 1) #$��� + $ ∙ ℐ(��)2ℬ − 1 &

where �(��) represents the saturation for the current region color, ℐ(��) the corresponding intensity and ℬ stands for the number of bits per sample.

3.3. Saliency map

The final saliency map is computed taking into consideration several conditions that generate saliency, determining the observer’s attention to focus on a specific area. An important aspect of a scene that generates saliency is the color contrast which is differently defined from intensity contrast. Two colors that are situated on opposite sides on the hue color wheel are generating contrast when situated next to one another. The first feature map is then build evaluating the color contrast situations present in the current frame.

There are five specific situations evaluated for color contrast, each contribution being wheighted before constructing the color feature map. For a current region �� , each region having an opposite hue brings a score for �� , as well as regions having a sufficiently distant hue. Third situation is the contrast due to the warm and cool colors, the warm colors being the attention attractors. Last two situations taken into consideration are the saturation contrast, comming from regions having colors with completly different saturations and the usual intensity contrast: '*�� =∑ ��� ∙ ���� ∙ �ℐ�� .1 + /�′�� + /ℋ�� + /ℐ��+/���2ℬ−1 5 , � =��� =1

∀ �� ∈ �� A size feature map is included in the color

contrast map due to the presence of the perimeter factor. The following feature map is generated from orientation and eccentricity, detecting the perceptual saliency of the objects having a particular orientation and shape. We use a traditional technique involving moments for finding the orientation and eccentricity of regions. Later the feature values are used to determine saliency with respect to these features. Three types of discrete 2-D moments are computed for each �� according to [2] as follows: �1,1� = 7(� − �̅)(9 − 9:) ∀(�, 9) ∈ �� �2,0� = 7(� − �̅)2 ∀(�, 9) ∈ �� �0,2� = 7(9 − 9:)2 ∀(�, 9) ∈ �� where the pair (�̅, 9:) is located in the center of the current region. The orientation and the eccentricity of a region are then computed as:

∅� = 12 tan−1 2�1,1��2,0� − �0,2�

�� = ��2,0� − �0,2� �2 + 4�1,1���2,0� + �0,2� �2

resulting the saliency map due to orientation and eccentricity:

'@A��� = 7 B���� =��� =1 + �� C B���

� =��� =1

where:

B��� =⎩⎪⎨⎪⎧ /∅��90 , ∀ /∅�� > ∅, I���� I < ��

J� /∅��90 , ∀ /∅�� > ∅, I���� I ≥ ��0, ����

234

B��� = �J�1, ∀ /��� < � I���� I < �� J�21, ��� � 3.4. Quality assessment

Our metric for video quality assessment is a simple blurriness metric, chosen for its computational simplicity and its speed. This quality evaluation method is used only for the regions previously detected as perceptually significant. The technique for measuring blurriness is based on the assumption that most significant edges in an image, which often represent borders of objects, are sharp [7].

Compression has a smearing effect on these edges, the extent of which our blurriness metric attempts to measure.

The algorithm is summarized as follows: • First, an edge detector (e.g. a Sobel filter) is

applied to the luminance component of the image. • Thresholding the edge gradients removes

noise and insignificant edges. • The start and end positions of each

significant edge in the image are defined as the locations of the local intensity minimum and maximum closest to the edge.

• The distance between these two points is identified as the local blur measure for this edge location.

The global blurriness for the whole image is estimated by averaging the local blur values over all significant edges found.

Tabel 1

Without significant

areas selection

With significant

areas selection

Prediction error

PSNR 90% 88% 10.1 NR quality metric

91% 92% 9.7

4. Experimental results

Following figures present a selection of the resulting saliency maps for different types of images. The algorithm is shown to give good estimates for the salient regions in tested images.

Here we summarize the results of an evaluation of the prediction performance of the NR blurriness metric as a predictor of overall perceived image quality. The images were created by compressing 29

color images (typically of size 768×512 pixels) using JPEG2000 encoder. Compression ratios range from 7.5 to 800, for a total of 169 compressed images. The subjective experiments were conducted in two separate sessions with 9 and 15 observers, respectively; the original uncompressed images were included in both. Observers provided their quality ratings on a continuous scale from 1 (lowest quality) to 100 (highest quality).

As shown in Table 1, PSNR is already an excellent predictor of perceived quality for this database: the correlation with the mean opinion score (MOS) is about 91%. These good results can be attributed largely to the fact that the database contains exclusively images created with a single type of encoder (JPEG2000) and thus contain mainly varying degrees of the same distortions. The hypothesis for using the NR blurriness metric is that the quality prediction is a simple non-linear transform of the measured blur for this dataset. To test this, we separated the test images into a training set and a test set, using 100 different random divisions of the dataset. As shown in Table 1, our metric achieves correlations of around 85% with MOS on the test sets, which is quite a good prediction performance for a NR metric.

Figure 1. On the left side – the original images and their corresponding saliency maps (brighter gray

represents regions with higher significance).

ITU has established standard test procedures for subjective video quality evaluations [8]. According to those recommendations, we have performed several objective evaluations with observer groups, for a collection of 78 short video sequences, obtaining difference mean opinion scores for each sequence. The following figure shows this quality evaluation method conduct to a set of very good results, since the characteristic fallows a logarithmic line.

50 100 150 200 250

20

40

60

80

100

120

140

160

180

50 100 150 200 250

20

40

60

80

100

120

140

160

180

50 100 150 200 250

20

40

60

80

100

120

140

160

180

50 100 150 200 250

20

40

60

80

100

120

140

160

180

235

Figure 2. Top left: original urban image; top right:

color contrast map; down left: orientation map; down right: final saliency map.

Figure 3. Top left: original Lena image; top right: color contrast map; down left: orientation map; down right:

final saliency map.

5. Conclusions and Future Work

The quality metric proposed in this paper first estimates the perceptually important areas using the key elements that attract the attention: color contrast, size, orientation and eccentricity. For these areas a distortion measure is then computed and significant results temporarily stored. Simulation results indicate that the proposed video quality metric clearly outperforms standard PSNR in estimating the quality of a video.

This algorithm has limitations and some are due to the fact that each frame is processed individually.

Figure 4. Scatter plot: subjective algorithm scores

ploted over DMOS (difference mean opinion scores).

Usually, frames in a video sequence are correlated between them and so, salient regions in consecutive images from that video can be estimated based on those regions previously detected in earlier frames. In our future work we will study the methods to estimate saliency in a given frame taking into consideration the results stored from previous frames. This approach can also be related to a model for the human visual short term memory.

5. References

[1]. Bovik, A.C.: ‘Handbook of Image&Video Processing’, 2nd Edition, Academic Press, 2006.

[2]. Aziz, M. Z., and Mertsching, B.: ‘Fast and robust generation of feature maps for region-based visual attention’, IEEE Trans. On Image Processing, 2008, 17, (5), pp. 633 – 644.

[3]. Treisman, A. M., and Gelade, G.: ‘A feature integration theory of attention’, Cogn. Psychol., 1980, vol. 12, pp. 97 – 136.

[4]. Itti, L., ‘Automatic foveation for video compression using a neurobiological model of visual attention’, IEEE Trans. Image Process., 2004, vol.13, (10), pp. 1304 – 1318.

[5]. Meur, O. L., Callet, P. L., Barba, D., Thoreau, D., ‘A coherent computational approach to model bottom-up visual attention’, IEEE Trans. Pattern Anal. Mach. Intell., 2006, vol. 28, (5), pp. 802 – 817.

[6]. Lu, Z., Lin, W., Yang, X., Ong, E., Yao, S., ‘Modeling visual attention’s modulatory aftereffects on visual sensitivity and quality evaluation’, IEEE Trans. Image Process., 2005, vol. 14, (11), pp. 1928 – 1942.

[7]. Rapantzikos, K., Tsapatsoulis, N., Avrithis, Y., Kollias, S., ‘Bottom-up spatiotemporal visual attention model for video analysis’, IET Image Process., 2007, 1, (2), pp. 237–248.

[8]. ITU-R Recommendation BT.500-11, “Methodology for the subjective assessment of the quality of television pictures.” International Telecommunication Union, Geneva, Switzerland, 2002.

0 10 20 30 40 50 60 70 80 90 1000

5

10

15VQM DCT; Zgomot alb

236