emotional valence recognition, analysis of salience...

6
Emotional Valence Recognition, Analysis of Salience and Eye Movements Hamed R.-Tavakoli * Victoria Yanulevskaya Esa Rahtu * Janne Heikkil¨ a * Nicu Sebe * Center for Machine Vision Research, University of Oulu, Finland DISI, University of Trento, Italy {hamed.rezazadegan,esa.rahtu,jth}@ee.oulu.fi {yanulevskaya,sebe}@disi.unitn.it Abstract—This paper studies the performance of recorded eye movements and computational visual attention models (i.e. saliency models) in the recognition of emotional valence of an image. In the first part of this study, it employs eye movement data (fixation & saccade) to build image content descriptors and use them with support vector machines to classify the emotional valence. In the second part, it examines if the human saliency map can be substituted with the state-of-the-art computational visual attention models in the task of valence recognition. The results indicate that the eye movement based descrip- tors provide significantly better performance compared to the baselines, which apply low-level visual cues (e.g. color, texture and shape). Furthermore, it will be shown that the current computational models for visual attention are not able to capture the emotional information in similar extent as the real eye movements. Keywordseye movements; emotion; valence; saliency; I. I NTRODUCTION The emotional impressions of a scene can have large impact on many applications such as advertisement, image search, human computer interaction, web page design, etc. It is usually measured in terms of valence, which refers to the extent to which an individual is attracted or repelled by the visual content. In this context, a scene is pleasant, neutral, or unpleasant. Fig. 1 depicts some examples of emotional stimuli. The realization of such emotions is somehow under the influence of the cognitive state of mind, which is somehow reflected by the humans’ biosignals, e.g., eye movement pattern variation in the famous Yarbus experiments. There are also studies that demonstrate emotional content affects the patterns of eye movements (e.g [1], [2], [3]). Intrigued, it is interesting to investigate the applicability of human eye movements to recognition of emotional valence and assess salience models as a utility to replicate them. Traditionally, the valence recognition/classification relies on low-level image descriptors (e.g. [4], [5], [6], [7]) or psychological models and art theories (e.g. [8], [9], [10]). Contrarily, this paper elaborates a framework that utilizes fea- tures extracted from the statistics of eye movements. Although, akin to the aforementioned methods, this paper utilizes the conventional photographs, its methodology somehow resem- bles a biosignal crowdsourcing scenario, in which information extracted from the eye movements of several observers is fused to judge the emotional content of a scene. A situation that Fig. 1. Example images that elicit different emotions. From left to right: unpleasant, neutral and pleasant images from [11]. may happen by popularity of personalized wearable biosignal recorders. It studies which characteristics of eye movements are the most informative and propose a complete framework for classifying the emotional valence. The main contributions of the paper are as follows: 1) It investigates the role of eye movement data in valence recognition. This is achieved by extracting a wide variety of features from eye movement data and building image content descriptors using these features. 2) It demonstrates that the proposed eye movement based descriptors significantly outperform the low- level cues, applied previously, in emotional valence classification. 3) It assesses the applicability of the computational visual attention models in substituting the recorded eye movements. It will be shown that computational models are behind the human performance in dealing with emotional stimuli. II. RELATED WORK In literature, many works on emotional valence classifi- cation have focused on the design of descriptors. The most explored features are color-based [8], [12], [5], [13], [9], [10], [6], [7]. Initially for art images, [8] employed the influence of color regions and color transitions on the emotional state of an observer. The idea is rooted in Itten’s art theory [14] where red color has positive emotional impact and is often associated with happiness, dynamism, and power. While purple is referred as a melancholy color with negative valence. Partic- ularly, Colombo with colleagues [8] considered the following features: hue, brightness, saturation, position and size of each

Upload: lylien

Post on 16-Mar-2018

230 views

Category:

Documents


1 download

TRANSCRIPT

Emotional Valence Recognition,Analysis of Salience and Eye Movements

Hamed R.-Tavakoli∗ Victoria Yanulevskaya† Esa Rahtu∗ Janne Heikkila∗ Nicu Sebe†∗ Center for Machine Vision Research, University of Oulu, Finland

† DISI, University of Trento, Italy{hamed.rezazadegan,esa.rahtu,jth}@ee.oulu.fi {yanulevskaya,sebe}@disi.unitn.it

Abstract—This paper studies the performance of recordedeye movements and computational visual attention models (i.e.saliency models) in the recognition of emotional valence of animage. In the first part of this study, it employs eye movementdata (fixation & saccade) to build image content descriptors anduse them with support vector machines to classify the emotionalvalence. In the second part, it examines if the human saliencymap can be substituted with the state-of-the-art computationalvisual attention models in the task of valence recognition.

The results indicate that the eye movement based descrip-tors provide significantly better performance compared to thebaselines, which apply low-level visual cues (e.g. color, textureand shape). Furthermore, it will be shown that the currentcomputational models for visual attention are not able to capturethe emotional information in similar extent as the real eyemovements.

Keywords—eye movements; emotion; valence; saliency;

I. INTRODUCTION

The emotional impressions of a scene can have largeimpact on many applications such as advertisement, imagesearch, human computer interaction, web page design, etc. Itis usually measured in terms of valence, which refers to theextent to which an individual is attracted or repelled by thevisual content. In this context, a scene is pleasant, neutral, orunpleasant. Fig. 1 depicts some examples of emotional stimuli.

The realization of such emotions is somehow under theinfluence of the cognitive state of mind, which is somehowreflected by the humans’ biosignals, e.g., eye movement patternvariation in the famous Yarbus experiments. There are alsostudies that demonstrate emotional content affects the patternsof eye movements (e.g [1], [2], [3]). Intrigued, it is interestingto investigate the applicability of human eye movements torecognition of emotional valence and assess salience modelsas a utility to replicate them.

Traditionally, the valence recognition/classification relieson low-level image descriptors (e.g. [4], [5], [6], [7]) orpsychological models and art theories (e.g. [8], [9], [10]).Contrarily, this paper elaborates a framework that utilizes fea-tures extracted from the statistics of eye movements. Although,akin to the aforementioned methods, this paper utilizes theconventional photographs, its methodology somehow resem-bles a biosignal crowdsourcing scenario, in which informationextracted from the eye movements of several observers is fusedto judge the emotional content of a scene. A situation that

Fig. 1. Example images that elicit different emotions. From left to right:unpleasant, neutral and pleasant images from [11].

may happen by popularity of personalized wearable biosignalrecorders.

It studies which characteristics of eye movements arethe most informative and propose a complete framework forclassifying the emotional valence. The main contributions ofthe paper are as follows:

1) It investigates the role of eye movement data invalence recognition. This is achieved by extractinga wide variety of features from eye movement dataand building image content descriptors using thesefeatures.

2) It demonstrates that the proposed eye movementbased descriptors significantly outperform the low-level cues, applied previously, in emotional valenceclassification.

3) It assesses the applicability of the computationalvisual attention models in substituting the recordedeye movements. It will be shown that computationalmodels are behind the human performance in dealingwith emotional stimuli.

II. RELATED WORK

In literature, many works on emotional valence classifi-cation have focused on the design of descriptors. The mostexplored features are color-based [8], [12], [5], [13], [9], [10],[6], [7]. Initially for art images, [8] employed the influenceof color regions and color transitions on the emotional stateof an observer. The idea is rooted in Itten’s art theory [14]where red color has positive emotional impact and is oftenassociated with happiness, dynamism, and power. While purpleis referred as a melancholy color with negative valence. Partic-ularly, Colombo with colleagues [8] considered the followingfeatures: hue, brightness, saturation, position and size of each

homogeneous region in the image, together with harmony,warmth and luminance. Eventually, based on these colorfeatures each region is associated with semantic terms whichare used to index images in order to introduce an art imageretrieval software. Wang and Yu [10] use a similar approach toconsider emotions evoked by art works. They utilize machinelearning techniques and fuzzy logic.

Later, Solli and Lenz [9] moved towards natural images.In this work, colors are interpreted based on psychologicalstudy of [15] which investigates the relationship between emo-tions and color preferences. A color based bags-of-emotionsrepresentation is proposed. It is a histogram of the numberof occurrences of particular color patterns in an image. Suchimage representation can be used to identify images of similaremotional impact. In the recent studies [12], [6], [7], the linkbetween image colors and evoked emotions is learned directlyfrom labeled images using machine learning techniques.

Texture and shape are another popular low-level fea-tures in emotion recognition task [4]. The motivation lieson the fact that people perceive curved visual objects aspositive [16], while angular or diagonal patterns evoke neg-ative emotions [17]. Arnheims gestalt theory [18] reflects thathumans visually prefer simplicity, therefore smooth lines aregenerally associated with positive emotions, while chaotictexture is considered as negative. Many of the recent studiescombine texture and color together [5], [6], [7], [19], [20].

Extending the feature set, Machajdik and Hanbury [5]combines low-level features with content related features suchas number of frontal faces, relative size of the biggest faceand amount of skin in the image. The authors investigatethe efficiency of features in predicting the affective responseto an image. The results suggest that images of differentkind like artwork, professional photos, and average qualitynatural images require different type of features. Finally, itis concluded that semantic content analysis is an essential stepin affective classification of natural images.

In order to interpret the semantic context of an image,an observer should discover salient objects and relationshipsbetween them. Although some progress has been made in areasof salient object detection [21] and action recognition [22],[23], the automatic interpretation of images is still one of themain challenges in computer vision. Recently, Subramanianet al. [24] proposes to use recorded human’s eye movementsduring the observation of an image as a proxy to imagesemantics. To this end, clusters of fixations are derived andstatistics of transaction between them is computed. This statis-tics successfully applied to detect the interacting objects andto distinguish highly expressive faces from neutral faces. Alsothe recent behavioral studies of [3] show that visual attentionis influenced by the emotional salience. In other words, theemotive regions in an image attracts more attention than morevisually salient but emotionally neutral regions.

This paper proposes to move away from low-level imagefeatures and consider features which better reflex image se-mantics - namely eye movements. Eye movements can be seenas a proxy to visual attention which is highly contextual. Thisis one of the first works to investigate how predictive differentcharacteristics of eye movement (e.g. duration, density, . . . )are for image valence recognition.

III. METHOD

This section studies a human centered framework for thetask of classifying the emotional valence of an image/scene.The system, which is illustrated in Fig. 2, relies on analyzingthe recorded eye movement. In the training phase, it learns theclassification model using an image corpus that contains man-ually annotated valence labels and recorded eye movement. Inthe test phase, the system predicts the valence label based ofthe eye-gaze statistics. The rest of this section describes thedetails related to data collection, descriptor construction, andclassifier design.

A. Eye movement data

The data consists of an ordered set of fixations (i.e. imagelocations where the human observer maintained his/her gazefor reasonable time period) and saccades (i.e. fast movementsof the eye). A fixation is defined by a quadruple F =(x, y, tstart, tend) containing image location coordinates (x, y)and time stamps indicating when the gaze arrived (tstart)and left (tend) this location. A saccade is defined as S =(xstart, ystart, xend, yend, tstart, tend) denoting the locationsand time stamps for the start and end of the saccade. Fig. 3depicts a hypothetical eye movement pattern and illustrates therelationship between the fixations and saccades.

For i-th saccade and fixation, denoted by Si and Fi, onecan define the following properties:

1) fixation location(Fi) = (xi, yi), the pixel coordinatesof the fixation.

2) fixation duration(Fi) = tend,i− tstart,i, the time thatthe gaze stayed at the fixation location.

3) saccade duration(Si) = tend,i − tstart,i, the time ittakes to move the eye from one fixation location tothe next.

4) saccade slope(Si) = (yend,i − ystart,i)/(xend,i −xstart,i), the direction and steepness of the straightline that connects two succeeding fixations (seetan(θ) in Fig. 3).

5) saccade length(Si) = ‖~Si‖, the length of the straightline that connects two succeeding fixations, where~Si = [xend,i − xstart,i, yend,i − ystart,i] and ‖.‖denotes the L2-norm.

6) saccade velocity(Si) = ‖~Si‖/(tend,i − tstart,i), therate of fixation location change.

7) saccade orientation(Si,Si−1) the angle betweentwo succeeding saccades computed ascos−1

((~Si· ~Si−1)/(‖~Si‖ · ‖~Si−1‖)

)(see φ in Fig 3).

B. Features

The proposed descriptors are constructed using the eyemovement properties listed above. It starts by defining afixation map, which is a binary image containing 1’s at thefixation locations and 0’s elsewhere. The fixation locations are

Learning Phase

Affective Image Classification

Hum

an

E MFeature Extraction

makes feature descriptors

from eye movement (E.M) data

LabelAffective

Image

Corpus

Learn a Classifier

Presentin

gim

ages

(a)

Test Phase

Affective Image Classification

Hum

an

E MFeature Extraction

makes feature descriptors

from eye movement data

Classifier

Presentin

gim

ages

unseen image

It is . . .

(b)

Fig. 2. An overview of the proposed framework for emotional valence classification. In the training phase, a classification model is learnt using an imagecorpus which contains annotated valence labels and recorded eye movement. In the test phase, the system predicts the valence label utilizing the eye movement.

y

x

φ

θ

F2

F1

F3

S2

S1

Fi: Fixation

Si: Saccade

φ: S2 orientation

tan(θ): S2 slope

Fig. 3. A hypothetical eye movement pattern which consists of saccades(rapid movements of the eye) and fixations (locations where the gaze remainsrelatively still).

collected using all the eye movements that are recorded forthe particular image (i.e. from several observers). The fixationmap is converted to a saliency map by convolving it with aGaussian kernel. The kernel parameter σ is set to 10 pixels,which roughly corresponds to 2◦ change in the visual angle1.Fig 4 illustrates examples of the fixation and saliency maps.

The saliency map contains information about the fixationlocations and density. It further separates these properties byextracting saliency histogram and top ten salient locations. Thefirst one is formed by making a histogram from the saliencymap values at fixation points. The histogram has 10 bins thatare uniformly distributed over the range [0, 1]. The top tensalient locations contains the coordinates of the 10 strongestlocal maxima in the saliency map. The local maxima areextracted using the inhibition of return mechanism presentedin [25]. This process simulates the transitions of the focus ofattention (FOA) from one location to the next which can beimplemented using a “winner-take-all” neural network model.Here, it uses the implementation of [26] where a disk withradius of 10 pixels is used as FOA model.

In addition to the above fixation based features, it for-mulates descriptors using the saccade properties. They are

1In NUSEF dataset, 1◦ of visual angle at 3 feet is translated to 5 pixels onscreen, and a fixation is detected if point-of-gaze remains within 2◦ of visualangle for at least 100 milliseconds. This choice of σ gives us a compact andaccurate representation of fixation density.

defined as four histograms: saccade slope (30 bins), saccadelength (50 bins), saccade velocity (50 bins), and saccadeorientation (36 bins). These are constructed by making ahistogram from the values of the properties 4, 5, 6, and7 respectively. For each histogram, the centers of bins areuniformly distributed between the minimum and maximumvalues of the corresponding property.

Finally, it constructs two descriptors using the fixation andsaccade durations (i.e. properties 1 and 2). In behaviouralstudies, durations are commonly characterised by minimum,maximum, median and mean values [27]. However, theseprovide only a partial picture of the timing statistics. Instead,this study adopts histogram representation, which containsdistributions of the saccade and fixation durations over theentire viewing task [28]. These are formed by quantisingthe range of minimum and maximum duration time into 60uniformly distributed bins.

fixation map

salience map

convolved

with a Gaussian

overlay fixations on

salience map

measuring salience values

at fixations & binning

0 1salience histogram

resize

vectorize

salience map as feature

Inhibition of

return

{x1, y1, . . . , x10, y10}top 10 salient locations

Fig. 4. An illustration of the process for constructing the fixation baseddescriptors.

C. Classification

In all the experiments, this paper applies 1-vs-rest clas-sification scheme (i.e. there exist one classifier per valence

category) and a support vector machine. In particular, it utilizesthe fast additive formulation described in [29] with a chi-squarekernel. It trains a separate classifier for each feature type andif combinations are needed they are formed by averaging thescores over individual classifiers. This was found to performbetter in the experiments compared to concatenating featuresbefore training the classifier. Altogether, this results in total 27classifiers for 9 features and 3 valence categories.

IV. EXPERIMENTS

A. Dataset

In the following experiments, this paper adopts a subsetof the International Affective Picture System (IAPS) [30],which consists of an emotionally affective image corpus withmanually annotated valence labels. To be more specific, ittakes all the samples that belong to “People and daily activity”category and have corresponding eye movement available inNUSEF dataset [11] (The original IAPS does not include eyemovement data). Furthermore, it leaves out the images forwhich the valence label is strongly disagreed by the humanannotators in IAPS. In total it considers 95 samples consistingof 47 neutral, 24 pleasant and 24 unpleasant images2.

B. Evaluation criteria

The classification results are assessed using a Leave-Pair-Out-Cross-Validation (LPOCV) scheme. The process includesthe following steps: 1) take two samples – one positive andone negative – in the test set and use the rest for training, 2)classify the test image pair using the obtained model, 3) repeatthe steps 1 and 2 until all pairs have been used for testing.The classification results are turned into precision-recall curve,which is summarized in terms of average precision (AP).For multi-class cases, the mean average precision (mAP) isreported.

C. Experiment 1: eye movement descriptors

The first set of experiments evaluates the performanceof the proposed descriptors individually. The resulting APvalues for each valence category and mAPs of all categoriesare shown in Fig. 5. One can observe clear performancedifferences between the valence categories. In most cases, theneutral images seem to be much easier to distinguish comparedto the pleasant or unpleasant classes. Fig. 6 depicts exampleoutput of the proposed framework.

As can be observed, there is a large differences between thedescriptors. For instance, top 10 salient locations performs verywell with pleasant images, but gives poor results for unpleas-ant and neutral categories. This indicates that the measuredproperties are strongly complementary and one can expect toobtain clear benefits by using them jointly. To summarize thefirst experiment, the two clearly best performing individualdescriptors are the fixation duration and saliency map.

2Data and code is available at: hrtavakoli.net

Fig. 6. Recognition example of the proposed framework. First row depictsthe salience map of correct recognized images and the bottom row visualizesthe miss classification. Left to right: unpleasant, neutral, pleasant. Images fromtop left are: 6561, 2200, 2530, 6560, 2491, 1340.

TABLE I. COMPARISONS WITH THE BASELINE

Method unpleasant pleasant neutral meanProposed (all) 0.71 0.55 0.81 0.69Proposed (2 best) 0.73 0.63 0.80 0.72SIFT 0.26 0.26 0.53 0.35LAB 0.28 0.26 0.51 0.35SIFT + LAB 0.27 0.26 0.52 0.35

D. Experiment 2: baselines

The second experiment assesses the proposed frameworkusing all features or the two best performing individual featuresjointly. The results are compared with the Bag-of-Visual-Words baseline system that was used in [6] for predictingthe emotional valence of the abstract paintings. They appliedthe visual words based on LAB-color values and dense SIFTdescriptors. The resulting APs and mAPs are listed in Table I.

It can be seen that the proposed eye movement baseddescriptors are performing clearly better than the BoW frame-work of [6]. This supports the hypothesis that low-level coloror texture based features are not enough to characterise contex-tually rich conventional photographs. However, it seems thatrecorded eye movement alone may convey enough informationto reasonably predict the emotional valence of the observer. Inaddition, it can be noted that the performance using only thetwo best individual features is roughly equal to the combina-tion of all features.

E. Experiment 3: computational models for visual attention

This experiment investigates if the recorded eye movementscould be replaced by computational models developed forpredicting visual attention. This is motivated by the recentresults (e.g. [31]) where computational models have shownto reach almost identical performance with human. It includesfive state-of-the-art models to this experiment: Achanta [32],adaptive whitening saliency (AWS) [31], graph based visualsaliency (GBVS) [33], context aware saliency (CAS) [34],and Tavakoli [35]. For each model, it uses the software andparameters provided by the original authors.

This experiment extracts only the proposed saliency mapdescriptor, but using either the recorded eye movements orone of the computational saliency models. For computationalmodels, the saliency maps are rescaled and vectorized asdescribed in Section III. The rest of the framework remainsthe same as in previous experiments. The results, summarizedin Fig. 7, indicate that the eye movement based approach

Fix

atio

nD

uratio

n

Salience

Map

Top

10

Salient

Locatio

ns

Sailence

Histogram

Saccade

Duratio

n

Saccade

Length

Saccade

Slo

pe

Saccade

Orie

ntatio

n

Saccade

Velo

city

0.5

0.6

0.7

AP

(a) Pleasant

Fix

atio

nD

uratio

n

Salience

Map

Top

10

Salient

Locatio

ns

Sailence

Histogram

Saccade

Duratio

n

Saccade

Length

Saccade

Slo

pe

Saccade

Orie

ntatio

n

Saccade

Velo

city

0.5

0.6

0.7

AP

(b) Unpleasant

Fix

atio

nD

uratio

n

Salience

Map

Top

10

Salient

Locatio

ns

Sailence

Histogram

Saccade

Duratio

n

Saccade

Length

Saccade

Slo

pe

Saccade

Orie

ntatio

n

Saccade

Velo

city

0.5

0.6

0.7

AP

(c) Neutral

Fix

atio

nD

uratio

n

Salience

Map

Top

10

Salient

Locatio

ns

Sailence

Histogram

Saccade

Duratio

n

Saccade

Length

Saccade

Slo

pe

Saccade

Orie

ntatio

n

Saccade

Velo

city

0.5

0.6

0.7

AP

(d) All

Fig. 5. (a-c) show the average precision (AP) values for each valence category and proposed descriptor type. (d) depicts the mean average precision (mAP)for each descriptor over the all three categories.

clearly outperforms the computational models. The hypothesisis that this happens because the emotional saliency overridesthe visual saliency, which is consistent with the findings of [3].

F. Experiment 4: fixation duration

The last experiment investigates the power of the best fea-ture of the proposed method, namely the histogram of fixationduration. To this end, it combines the computational saliencymaps considered in the Experiment 3 with the recorded hu-mans’ fixation duration in the same way as described in Section3. It compares the results with the proposed method with 2best features - combination of the recorded humans’ fixationduration with the saliency map based on humans’ fixationlocations.

The results are depicted in Fig. 8. At the first glance,as expected, the proposed framework outperforms the combi-nation of artificial salience and fixation duration on averageover all categories. However, careful study of the resultson each category gives a better understanding. In case ofpleasant stimuli, it can be noted that combination of someartificial models and fixation duration performs better thanthe proposed framework. For instance, combination of CASand fixation duration outperforms all the other features. Thismay suggest use of artificial models and human eye movementstatistics in the application of valence recognition. Thus, itevaluated the combination of CAS and proposed frameworkwhich revealed that the overall performance would be still 2%behind the proposed framework, though a small gain in caseof pleasant stimuli is achieved. It indicates a wiser methodfor combining computational models of visual attention andhuman eye movement is needed.

Comparing Fig. 7 and 8 suggests that fixation durationboosts the performance of computational models of visualattention. This signifies the importance of temporal informationin salience modeling when dealing with the emotional stimuli.

V. CONCLUSION

This paper investigated the applicability of the recorded eyemovement data for the task of emotional valence classification.To this end, the eye movement patterns were utilized tobuild image content descriptors. It learned that the fixation

information can be a strong cue for inferring the emotionalvalence of an image. Especially, the fixation duration andsaliency map provided to be excellent features to encodeemotional information.

Additionally, it examined if the computational modelscould replace the recorded eye movements in the classificationframework. It was found that none of the tested state-of-the-art models could achieve similar performance as the eyemovement based version.

Currently the main limitation for further development, isthe lack of a large image dataset containing both emotionalvalence labels and recorded eye movement data. In the futurework, it intends to develop such database which could alsobe very useful in developing computational models for visualattention.

ACKNOWLEDGMENT

This work was supported by the Academy of Finland(Grant no. 259431) and Infotech Oulu doctoral program.

REFERENCES

[1] C. H. Hansen and R. D. Hansen, “Finding the face in the crowd: Ananger superiority effect,” J. Pers. Soc. Psychol., vol. 54, pp. 917–924,1988.

[2] F. Pratto and O. P. John, “Automatic vigilance: The attention-grabbingpower of negative social information,” J. Pers. Soc. Psychol., vol. 61,pp. 380–391, 1991.

[3] K. Humphrey, G. Underwood, and T. Lambert, “Salience of the lambs:A test of the saliency map hypothesis with pictures of emotive objects,”J. Vis., vol. 12, no. 1, 2012.

[4] X. Lu, P. Suryanarayan, R. B. Adams, Jr., J. Li, M. G. Newman, andJ. Z. Wang, “On shape and the computability of emotions,” in ACMMM, 2012.

[5] J. Machajdik and A. Hanbury, “Affective image classification usingfeatures inspired by psychology and art theory,” in ACM MM, 2010.

[6] V. Yanulevskaya, J. Uijlings, E. Bruni, A. Sartori, E. Zamboni, F. Bacci,D. Melcher, and N. Sebe, “In the eye of the beholder: employingstatistical analysis and eye tracking for analyzing abstract paintings,”in ACM MM, 2012.

[7] V. Yanulevskaya, J. van Gemert, K. Roth, A. Herbold, N. Sebe, andJ. Geusebroek, “Emotional valence categorization using holistic imagefeatures,” in ICIP, 2008.

[8] C. Colombo, A. Del Bimbo, and P. Pala, “Semantics in visual informa-tion retrieval,” Multimedia, IEEE, vol. 6, no. 3, pp. 38 –53, 1999.

Achanta

AW

S

CAS

GBVS

Tavakoli

Hum

an

0.5

0.6

0.7AP

(a) Pleasant

Achanta

AW

S

CAS

GBVS

Tavakoli

Hum

an

0.5

0.6

0.7

AP

(b) Unpleasant

Achanta

AW

S

CAS

GBVS

Tavakoli

Hum

an

0.5

0.6

0.7

AP

(c) Neutral

Achanta

AW

S

CAS

GBVS

Tavakoli

Hum

an

0.5

0.55

0.6

0.65

mAP

(d) All

Fig. 7. The results for the computational bottom-up saliency models and the eye movement based saliency map in valence classification task, (a-c) show theaverage precision (AP) values for each valence category (d) depicts the mean average precision (mAP) over the all three categories.

FD

Achanta

+FD

AW

S+

FD

CAS

+FD

GBVS

+FD

Tavakoli

+FD

Proposed

0.5

0.6

0.7

AP

(a) Pleasant

FD

Achanta

+FD

AW

S+

FD

CAS

+FD

GBVS

+FD

Tavakoli

+FD

Proposed

0.5

0.6

0.7

AP

(b) Unpleasant

FD

Achanta

+FD

AW

S+

FD

CAS

+FD

GBVS

+FD

Tavakoli

+FD

Proposed

0.5

0.6

0.7

AP

(c) Neutral

FD

Achanta

+FD

AW

S+

FD

CAS

+FD

GBVS

+FD

Tavakoli

+FD

Proposed

0.5

0.6

0.7

AP

(d) All

Fig. 8. (a-c) show the average precision (AP) values for each valence category for computational models of visual attention in conjunction with fixation duration(FD) and the proposed method using the 2 best features. (d) depicts the mean average precision (mAP) over the three categories.

[9] M. Solli and R. Lenz, “Color based bags-of-emotions,” in CAIP, 2009.

[10] W.-N. Wang and Y.-L. Yu, “Image emotional semantic query based oncolor semantic description,” in ICMLC, 2005.

[11] S. Ramanathan, H. Katti, N. Sebe, M. Kankanhalli, and T.-S. Chua,“An eye fixation database for saliency detection in images,” in ECCV,2010.

[12] M. Dellagiacoma, P. Zontone, G. Boato, and L. Albertazzi, “Emotionbased classification of natural images,” in DETECT, 2011.

[13] M. Solli and R. Lenz, “Color emotions for image classification andretrieval,” in CGIV, 2008.

[14] J. Itten, The art of color : the subjective experience and objectiverationale of color. New York: John Wiley, 1973.

[15] L.-C. Ou, M. R. Luo, A. Woodcock, and A. Wright, “A study ofcolour emotion and colour preference. part i: Colour emotions for singlecolours,” Color Research & Application, vol. 29, no. 3, pp. 232–240,2004.

[16] M. Bar and M. Neta, “Humans prefer curved visual objects,” Psycho-logical Science, vol. 17, no. 8, pp. 645–648, 2006.

[17] J. Aronoff, “How we recognize angry and happy emotion in people,places, and things,” Cross-Cultural Research, vol. 40, pp. 83–105, 2006.

[18] R. Arnheim, Art and visual perception: A psychology of the creativeeye, 1974.

[19] H. Zhang, E. Augilius, T. Honkela, J. Laaksonen, H. Gamper, andH. Alene, “Analyzing emotional semantics of abstract art using low-level image features,” in IDA, 2011.

[20] Q. Wu, C. Zhou, and C. Wang, “Content-based affective image classi-fication and retrieval using support vector machines,” in ACII, 2005.

[21] C. Kanan and G. Cottrell, “Robust classification of objects, faces, andflowers using natural image statistics,” in CVPR, 2010.

[22] H. J. Seo and P. Milanfar, “Action recognition from one example,” IEEETrans. Pattern Anal. Machine Intell., vol. 33, no. 5, pp. 867 –882, 2011.

[23] A. Bulling, J. Ward, H. Gellersen, and G. Troster, “Eye movementanalysis for activity recognition using electrooculography,” IEEE Trans.Pattern Anal. Machine Intell., vol. 33, no. 4, pp. 741–753, 2011.

[24] R. Subramanian, V. Yanulevskaya, and N. Sebe, “Can computers learnfrom humans to see better?: inferring scene semantics from viewers’eye movements,” in ACM MM, 2011.

[25] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visualattention for rapid scene analysis,” IEEE Trans. Pattern Anal. MachineIntell., vol. 20, no. 11, pp. 1254 –1259, 1998.

[26] D. Walther and C. Koch, “Modeling attention to salient proto-objects,”Neural Networks, vol. 19, pp. 1395–1407, 2006.

[27] M. R. Greene, T. Liu, and J. M. Wolfe, “Reconsidering yarbus: A failureto predict observers task from eye movement patterns,” Vision Res.,vol. 62, no. 0, pp. 1 – 8, 2012.

[28] A. Borji, H. R. Tavakoli, D. N. Sihite, and L. Itti, “Analysis of scores,datasets, and models in visual saliency prediction,” in ICCV, 2013.

[29] S. Maji, A. Berg, and J. Malik, “Classification using intersection kernelsupport vector machines is efficient,” in CVPR, 2008.

[30] P. Lang, M. Bradley, and B. Cuthbert, “International affective picturesystem (iaps): Affective ratings of pictures and instruction manual,”University of Florida, Gainesville, FL, Tech. Rep. A-8, 2008.

[31] A. Garcia-Diaz, X. Fdez-Vidal, X. Pardo, and R. Dosil, “Saliency basedon decorrelation and distinctiveness of local responses,” in CAIP, 2009.

[32] R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk, “Frequency-tunedsalient region detection,” in CVPR, 2009.

[33] J. Harel, C. Koch, and P. Perona, “Graph-based visual saliency,” inNIPS, 2007.

[34] S. Goferman, L. Zelnik-Manor, and A. Tal, “Context-aware saliencydetection,” in CVPR, 2010.

[35] H. Rezazadegan Tavakoli, E. Rahtu, and J. Heikkil, “Fast and efficientsaliency detection using sparse sampling and kernel density estimation,”in Image Analysis, 2011.