zeshan hussain, tariq patanam and hardie cate june 6, 2016 arxiv:1701.01885v1 [cs.cv ... ·...

Group Visual Sentiment Analysis

Zeshan Hussain, Tariq Patanam and Hardie Cate

June 6, 2016

Abstract

In this paper, we introduce a framework for classi-fying images according to high-level sentiment. Wesubdivide the task into three primary problems: emo-tion classification on faces, human pose estimation,and 3D estimation and clustering of groups of peo-ple. We introduce novel algorithms for matching bodyparts to a common individual and clustering peoplein images based on physical location and orientation.Our results outperform several baseline approaches.

1 Introduction

A key problem in computer vision is sentiment anal-ysis. Uses include identifying sentiment for socialnetwork enhancement and emotion understanding byrobotic systems. Identifying individual people andfaces has many uses but has been explored exten-sively. Group sentiment analysis, on the other hand,has been much less researched. Given an image of agroup of people, we may wish to describe the over-all sentiment of the scene. For example, if we havean image of students in a classroom, we may wish todetermine the level of human interaction, happinessof the students, their degree of focus, and other softscene characteristics that together describe the over-all sentiment. To tackle this problem, we propose amulti-label classification system that outputs labelsfor each of our scene characteristics. To perform thisclassification, we localize dense features from facesand poses of people as well as spatial relations ofpeople in the image.

2 Related Work

2.1 Summary of Previous Work

Individual sentiment analysis has been a long studiedproblem. There exists a hosts of methods to extractemotion features. Notable among them is the Half-Octave Multi-Scale Gaussian derivative technique toquickly extract . Employing this method, Jain andCrowley were able to accurately detect over 90%smiles accurately [8]

Discovering groups in images itself is a significantproblem. Detecting people around occlusions andviewpoint changes, and then grouping according toorientations and the poses of the people for multiplegroups of people is a highly complex problem thathas only recently been solved. Specifically, Choi et al.describe a model that learns an ensemble of discrim-inative interaction patterns to encode the relation-ships between people in 3D [4]. Their model learnsa minimization potential function to efficiently mapout groups of people in images as well as their activ-ity and pose (i.e. standing, facing each other). Priorwork has focused on the activity of a single person orpairs of people [9] [6] [11] [10] as well as pose estima-tion of groups, albeit operating under the assumptionthat there is only one group in the image [7].

There has also been some work done analyzing thesentiment of groups in images. Dhall et al., for ex-ample, measure happiness intensity levels in groupsof people by utilizing a model that takes advantageof the global and local attributes of the image [5].Borth et al. take a more general approach to senti-ment analysis by training several concept classifierson millions of images taken from Flickr. A predictionfor an arbitrary image is an adjective-noun pair that

1

arX

iv:1

701.

0188

5v1

[cs

.CV

] 7

Jan

201

7

defines the scene as well as its sentiment [1].

Finally, work has been done analyzing crowds ofpeople by taking a more top-down approach. Usingbehavior priors, Rodriguez et al. track the motionof crowds. Other methods include taking a bottom-up approach, where collective motion patterns arelearned and used to constrain the motion of individ-uals within crowds in a certain test scene [12].

2.2 Improvements on Previous Work

We contribute a novel multi-label group sentimentclassification system that is built by using severalcomponents from previous papers, including emotionand pose classification as well as group structure andorientation prediction. Our method of combiningthese features to classify group sentiment is a new ap-proach to sentiment analysis. Additionally, we pro-pose a simple, novel clustering approach to predictgroup structure and orientation. Specifically, we usea heuristic for 3D estimation of the people in theimage and then use a variant of k-means to clusterthem.

3 Technical Approach

3.1 Summary

For each input image, I, we perform the followingfeature extractions:

1. Emotion Extraction

2. Poselet Estimation

3. Group Structure and Orientation Estimation

Suppose that each feature extraction results in thefeature vectors, f1, f2, f3, respectively. Our extrac-tion is done in such a way that for a fixed i = 1, 2, 3,the number of entries in fi is the same for all im-ages. This allows us to simply concatenate featuresfrom each of the three approaches for use in an SVMclassification.

3.2 Technical Background andDatasets

Our primary dataset consists of 597 images, each cor-responding to one of six scenes which are ”bus stop,””cafeteria,” ”classroom,” ”conference,” ”library,” and”park.” It is the same dataset used by [4]. We de-fine four scene characteristics, namely level of hu-man interaction, physical activity, happiness, and fo-cus, that describe each image. In order to providea ground truth labeling for image characteristics, wego through all of our images and manually annotatethem on a scale of 1 to 4, with 1 as the least and 4 asthe most. For instance, a classroom scene with stu-dents working at their desks would most likely cor-respond to a focus score of 4, whereas an image ofpeople staring off into space while waiting for a buswould likely have a score of 1.

3.3 Technical Solution

To get a better sense for the problem, we begin withseveral (naive) baseline approaches that we improveupon with more advanced techniques. First, we cre-ate a color histogram for each image that we use asinput features for an SVM. Each pixel consists of val-ues for 3 color channels and we divide the possiblevalues for channel into 8 bins, so each pixel votes forone of 8 × 8 × 8 = 512 bins according to the valuesin its color channels. We then use this histogram of512 values as input for an SVM. This model ends upoverfitting on our small dataset but performs slightlybetter than random on a validation set.

As a second baseline, we incorporate features frombounding boxes that define the locations of peoplein the images. These bounding boxes come withthe dataset which was used in the paper by Choi etal. discussed above. We use the coordinates of thebounding boxes directly as features for input to anSVM. We limit the number of bounding boxes perimage to 15 (since the number of bounding boxes inmost images were below this threshold) by randomlyselecting boxes if there are more present in the imageand padding our flattened vector of coordinates withzeros if there are fewer boxes in the image. Similarto the above approach, this technique only slightly

2

outperforms a random selection process, with moredetailed results discussed below.

3.4 Emotion Classification

Our emotion classification pipeline involves severalstages.

3.4.1 Face Detection

First, we detect faces by leveraging the Poselet de-tector [2] and the Haar Cascade classifier built intoOpenCV [3]. For a given image, we get all face-relatedposelets (effectively bounding boxes within the im-age) from the Poselet detector and then run the HaarCascade classifier to detect faces within these bound-ing boxes. We remove duplicate faces by checking foroverlaps among the face bounding boxes.

Unfortunately, the dataset we use contains manypeople who are far away from the camera and so theirfaces are often small and therefore difficult to detectwith traditional face detection methods. To addressthis issue, we take advantage of the fact that we havebounding boxes from people from [4] and that wecan identify torsos of people using poselets. This ad-ditional information makes it possible to adjust thesettings of the Haar classifier to be lenient and thenprune erroneous faces using a novel matching algo-rithm that we describe below.

Suppose we have an image I. Suppose further thatwe have sets of bounding boxes in the image for peo-ple, faces, and torsos which we denote P , F , and T ,respectively. First, we loop over F and assign to eachface f ∈ F a person p ∈ P such that f is containedin p and p is the bounding box of the ones remainingthat minimizes the distance between the centers ofthe top edges of the bounding boxes. This distanceis a useful heuristic because faces should be centeredand near the top within the bounding box to whichthey belong.

Next, we loop over P from smallest to largest andassign to each person p ∈ P a torso t ∈ T suchthat t is contained in p and t is the largest suchtorso from among the remaining unassigned torsos.We proceed through P from smallest to largest be-cause smaller bounding boxes are less likely to have

a large number of torso contained in them, so thesebounding boxes provide tighter restraints. We choosethe largest torso contained in P because we find thatthere are often a large number of small erroneous tor-sos while larger torso (that also fit within a boundingbox) are more likely to correspond to actual humantorsos.

Figure 1: Face extraction when done only with a HaarCascade face detector (top) vs. first looking for headposelets and then detecting faces within them

At the end of the algorithm, each bounding boxcorresponding to a person is assigned at most oneface and one torso. Any unassigned bounding boxfor a person, face, or torso is discarded. This allowsus to concentrate on the most important people inthe image (i.e., people nearest to and likely facingthe camera). Torsos are potentially useful later onfor pose estimation using silhouettes. Faces are usefulboth for emotion classification and for 3D estimation.

3

3.4.2 Emotion Classification Method

Once the faces are detected, we classify them withemotion labels. In order to do so, we generate fea-tures for each face using a Multi-scale Gaussian pyra-mid modeled after the approach found in [8]. Given aface image as input, we convert the image to grayscaleand iteratively perform Gaussian convolutions on theimage at varying scales. At each iteration, we com-pute first- and second-order image gradients at eachpixel, then aggregate the means and standard devi-ations of these gradients in 4x4 windows across theentire image. Each 4x4 window at a given scale pro-duces 10 features, 5 from the means of the five dif-ferent gradient types and 5 from the standard de-viations. At each scale, we aggregate features threetimes using the half-octave method, and we build ourpyramid to a depth of three scales. At scale i, we per-form the following convolution pipeline to an inputimage.

Given : I0(x, y, i)

I1(x, y, i) = I0(x, y, i) ∗ g(x, y;σ)

I2(x, y, i) = I1(x, y, i) ∗ g(x, y;√

2σ)

I0(x, y, i+ 1) = I2(x, y, i)

We use the aggregate of I0, I1, I2 at each stage asour features.

Though this technique is originally used strictly fordetecting smiles, we extend the technique to classifyother emotions since the same features should be rel-evant. Using the feature vectors produced from theMulti-scale Gaussian pyramid, we train an SVM onthese features to classify various emotions. We trainon a dataset that contains only faces and is annotatedwith emotions including happy, sad, surprised, fear,etc. A more complete discussion can be found in theExperiments section, but ultimately we use a binaryhappy/neutral classification when training our SVM.

Once we have a prediction for each face, we dividethe image into 16 evenly spaced windows. Each win-dow acts as 2-class histogram to produce a count ofthe number of smiles and neutral faces for each por-tion of the image. We then use this 32-element vectoras our feature vector for our emotion pipeline. This

binning process standardizes the length of the featurevector for each image so that we can run an SVM onthese features for each of our final sentiment classes.

3.5 Pose Estimation

As mentioned earlier, the Poselet detector providesa wealth of information relating to people in images.There are 150 types of poselet descriptors, and eachone acts as a detector for different body parts (e.g.,left leg, right arm, head, shoulders, etc.).

Given an input image, we get all poselets as bound-ing boxes along with an i.d. and score for eachposelet. The i.d. identifies the type of poselet andthe score indicates the accuracy with which the im-age within the bounding box matches that particularposelet type. We filter out all poselets below a certainscore threshold. Empirically, we find that a score of0.9 provides a reasonable threshold. After filteringout these erroneous poselets, we produce a 150-classfeature vector for the image by binning poselets ac-cording to their i.d. This gives us a count of thenumber of occurrences of each poselet type.

As discussed in the Experiments section, these fea-ture vectors provide a reasonable estimation of thedifferent human poses throughout the image. Pose-lets not only identify body parts, but they also giveindications of the shapes of these parts (e.g., elbowbent vs. elbow straight). These feature vectors fromposelets ultimately prove to have significant predic-tive power.

3.6 Group Structure and OrientationEstimation

We now describe our approach for predicting orienta-tion and group structures in images. First, we com-pare two methods for prediction of orientation. Wethen propose a 3D estimation algorithm to estimatethe 3D coordinates of people. Then, using both ori-entations and 3D coordinates, we run a variant ofthe k-means algorithm to cluster people into groups.Below, we detail our two methods for orientation pre-diction.

In the first method, we use grab cut to get thesilhouettes of the people in each image. To estimate

4

relative foreground and background of each person,we utilize the matched torso and face per boundingbox. Specifically, we mark a stripe of pixels in theface of the person and a rectangular area of pixelsin the torso of a person to be the foreground andall the pixels outside of the bounding box to be thebackground. We then extract SIFT features on theresultant silhouettes and train an SVM using thesefeatures to predict orientation.

Our second method is slightly simpler. Instead ofextracting SIFT features from silhouettes, we extractHOG features from the bounding box. The benefitof this second method is that we obtain a denser fea-ture representation of the image without losing the”global” nature of the image. Because HOG looks atthe image more globally, it does better with occlu-sions. For our orientation prediction, we move for-ward with this second method due to its simplicity,representational efficiency, and time constraints.

To estimate the depth of each person, we utilize thefollowing relationship between depth and size of theface, d = k

f , where d is depth, f is the size of the facein the image, and k is a constant. Intuitively, as theface size decreases, the depth of the face increases.

Finally, with each person mapped to a 3D coordi-nate, we can run a variant of k-means constrained bythe orientations of people to find clusters of groups inthe image. Our variation of k-means uses an orienta-tion coefficient which linearly weights the traditionalEuclidean distance between a point and its corre-sponding cluster centroid. For each person with unitorientation vector θ, we compute the vector θ − φ,where φ is the unit vector along the direction be-tween the person and the cluster center. The orien-tation coefficient c is computed as a linear functionof θ − φ, with c = 0.5 when θ = φ and c = 1.5 whenθ = −φ. This orientation heuristic favors clusters inwhich more people in a cluster are facing the clus-ter center, mimicking situations in which people arefacing each other during a conversation.

With this modified distance function in place, werun k-means with varying values of k. We then choosethe value of k that minimizes the sum of modified dis-tances between points and their cluster centers whileplacing a restriction on the number of singleton clus-ters (clusters with only one person). We place this

restriction on singleton clusters because we want toavoid the situation in which all centers are coincidentwith people since we do not know the value of k be-forehand. Note that we do not use mean shift becausewe also do not know the appropriate window size be-forehand, and mean shift generally does not performas well on small, sparse datasets. We define our po-tential function below which we want to maximize toachieve the optimal k.

n∑i

~o · ~c− k ∗n∑i

d

where

k = constant factor

o = orientation vector

c = direction vector to cluster center

d = distance to cluster center

This allows us to avoid over-invidualizing thegroupings while only including persons within the areasonable distance.

It is important to note that due to time constraints,we were unable to fully implement the 3D estimationand orientation aspects of our pipeline. We did, how-ever, implement most of the pipeline and were ableto produce a very accurate classifier of person ori-entations. We believe the approach outlined abovewould identify groups of people adequately enoughto provide additional predictive power on the overallproblem of sentiment classification.

4 Experiments

As mentioned, getting a grasp of the group senti-ment requires several individualized parts to cometogether. Therefore, as to bring together these in-dividualized parts, we conducted a series of experi-ments.

4.1 Baselines

We start by establishing two baselines. Our firstbaseline is built by extracting color histograms for

5

each images then modeling an SVM using those fea-ture vectors. We achieve 28.3% or about 3% betterthan random. Our second baseline is built by build-ing a feature vector from the bounding boxes. Theidea behind this was to form a crude estimation ofthe groupings. It performed relatively poorly, achiev-ing an average accuracy of 30%, only 5% better thanrandomly binning into one of four intensities for asentiment class.

4.2 Smile Detection

A key component of group sentiment is determiningindividual sentiment. We begin by testing our imple-mentation of the Half-Octave Multi-Scale GuassianPyramid algorithm[8] on the GENKI 4K Dataset[13].We successfully achieve an accuracy of 83.5% whenclassifying in binary fashion for smile or neutral, onpar with current methods for smile detection.

Crowley UsTraining error N/A 0.0Testing error 0.082 0.165

Table 1: Error comparisons with Crowley paper

Admittedly, our accuracy is slightly below that ofJain & Crowley as we do not fine-tune the param-eters of the soft-margin SVM as meticulously. Yetas we will see this individualized emotion extractionalgorithm plays a key roll in improving overall groupsentiment analysis prediction.

4.3 Sad and Happy Detection withthe Kaggle Dataset

Using the same Gaussian Pyramid algorithm, we fur-ther test our ability to extract features and detectemotions on the Kaggle Facial Expressions Recogni-tion Challenge dataset. The detection system workedless robustly on this set achieving 71.5% accuracy inclassifying between happiness and neutrality and 56%when classifying between sadness and neutrality.

There are two important observations to note here.First, the images in the GENKI set are more re-fined, featuring almost always only faces that are

larger than those from the Kaggle set. The Kag-gle set meanwhile often has pictures with handscovering parts of the face. Occlusions, then, areparticularly challenging to the Multi-Scale Gaussianmethod. Second, the Gaussian Pyramid algorithmseems particularly well-adjusted to the intricacies of asmile. While not performing as well when identifying”happy” verus ”neutral” in the Kaggle set as it did onthe GENKI set, the algorithm did better within theKaggle set on the binary classification of ”happy” ver-sus the binary classification of ”sad.” This could bebecause the sadness expressed in the Kaggle datasetwas usually visually less expressive than the happi-ness, and therefore less distinct from the neutral state

4.4 Orientation Classification

The next stage in overall group sentiment analysis isincorporating the groupings between the individualpeople. In order to do that, we built a relativelyrobust SVM model using HOG feature extractiontrained on the TUD Multiview Pedestrian Dataset,successfully achieving 90.3% accuracy when classify-ing orientations into one of eight cardinal directions.Such a high accuracy rate is likely due to the fact thatgradient orientations can easily define the overall ori-entation of people, and that gradient orientations arethe basis of HOG extracted features.

Figure 6: A comparison of a GENKI happy, a neutralKaggle, and a sad Kaggle image. Note there is littlevisual difference between the neutral and sad Kaggleimages

4.5 Group Sentiment Anlaysis

We can then featurize our (detailed in the our ap-proach) distinct features into an overall scene feature

6

vector. Their performance is detailed below.

4.5.1 Individual Emotion Extraction

When using an overall facial emotion feature vectorby, we achieve strong results for happiness and ac-tivity. This is expected as our Multi-Scale Gaussianalgorithm proves strongest in being able to detectsmiles which correlate heavily with the overall hap-piness of a scene as well as, in the case of the GroupDiscovery Dataset, with activity. In contrast, over-all group ”focus” is probably more correlated withposture and ”interaction” more correlated with thenumber and size of the groupings.

4.5.2 Poselets

When using only the most identifying poses visiblein the scene, weighing them by their score from theposelet detector, we achieve better results only for”activity.” While we expect better results for focusas well, this could very well be an indication that”focus” is not easily measurable from one frame of ascene.

4.5.3 Individual Emotion Extraction andPoselets

Our final experiment is our true goal: testing our co-hesive method for classifying overall scene sentimentanalysis. For each sentiment we bin into four inten-sities and achieve accuracies as follows:

4.5.4 A Binary Classification of Intensities

We achieve 45% accuracy in classifying the happinessof scenes into their right intensity bucket. However,classifying other sentiment values were less accurate.To a great degree, this could be because of the fick-leness of human labeling. It is often hard, even for ahuman, to distinguish between a focus intensity of 3and a focus intensity of 4. As such, we ran one finalexperiment, labeling all intensity values of 3 or 4 as 1and all intensity values of 1 or 2 as 0. The followingare the results for this modified binary classifier.

Taking happiness as an example, as it is our mostaccurate result, we achieve a 53.4% improvement over

random selection while with an intensity classifier weachieve an 80.4% improvement over random classi-fication. Thus there is not a strong improvement inusing a binary classifier over the four degree intensityclassifier which means the weaknesses more likely, notin the gradients of labeling but in the previously men-tioned techniques of feature extraction itself.

4.6 Final Notes

Unfortunately we did not have enough time to use ourrobust orientation detection model to test group dis-covery. An ordering of our thoughts, however, are de-tailed in the technical approach. In order to achieveeven better results in the future, we would likely needto incorporate more complex scene information suchas objects in the background and scene type.

5 Conclusion

In closing, we summarize a couple of important re-sults. For our feature extraction results, we have highperformance detecting emotion (namely, smiles) aswell as on predicting orientation. Using these fea-tures, we perform reasonably on happiness and ac-tivity, We see a strong correlation between our ex-tracted emotions with happiness level as well as be-tween poselets and activity level in the image, whichis an intuitive result. In our further work, we hope tocomplete our group estimation pipeline and combinethese features into our final feature vectors. Addingthe group features will most likely increase the ac-curacy of our system in predicting interaction andfocus.

In this paper, we contribute a multi-label classifica-tion system that performs significantly above randomchance with our feature extraction methods. Our ap-proach effectively utilizes previous feature extractionmethods as well as novel methods.

Finally here is a link to our Github repo:https://github.com/zeshanmh/VisualSentiment

7

Combined Feature Extraction ResultsMetric Happiness Activity Interaction FocusTraining Er-ror

0.306 0.285 0.310 0.266

TestingError

0.558 0.617 0.642 0.667

Table 2: Errors for the SVM trained by extracting emotions and poselets

Combined Feature Extraction Results under a Binary ClassifierMetric Happiness Activity Interaction FocusTraining Er-ror

0.189 0.163 0.237 0.277

TestingError

0.233 0.242 0.4 0.4167

Table 3: Errors for the SVM trained by extracting emotions and poselets but using a binary classifier

6 References

[1] Damian Borth et al. “Large-scale visual sen-timent ontology and detectors using adjectivenoun pairs”. In: Proceedings of the 21st ACMinternational conference on Multimedia. ACM.2013, pp. 223–232.

[2] Lubomir Bourdev and Jitendra Malik. “Pose-lets: Body part detectors trained using 3d hu-man pose annotations”. In: Computer Vision,2009 IEEE 12th International Conference on.IEEE. 2009, pp. 1365–1372.

[3] G. Bradski. “opencv”. In: Dr. Dobb’s Journalof Software Tools (2000).

[4] Wongun Choi et al. “Discovering groups of peo-ple in images”. In: Computer Vision–ECCV2014. Springer, 2014, pp. 417–433.

[5] Abhinav Dhall, Roland Goecke, and TomGedeon. “Automatic group happiness inten-sity analysis”. In: Affective Computing, IEEETransactions on 6.1 (2015), pp. 13–26.

[6] Piotr Dollar et al. “Behavior recognition viasparse spatio-temporal features”. In: VisualSurveillance and Performance Evaluation ofTracking and Surveillance, 2005. 2nd Joint

IEEE International Workshop on. IEEE. 2005,pp. 65–72.

[7] Marcin Eichner and Vittorio Ferrari. “Weare family: Joint pose estimation of multiplepersons”. In: Computer Vision–ECCV 2010.Springer, 2010, pp. 228–242.

[8] Varun Jain and James Crowley. “Smile detec-tion using multi-scale gaussian derivatives”. In:12th WSEAS International Conference on Sig-nal Processing, Robotics and Automation. 2013.

[9] Ivan Laptev. “On space-time interest points”.In: International Journal of Computer Vision64.2-3 (2005), pp. 107–123.

[10] Jingen Liu, Jiebo Luo, and Mubarak Shah.“Recognizing realistic actions from videos “inthe wild””. In: Computer Vision and PatternRecognition, 2009. CVPR 2009. IEEE Confer-ence on. IEEE. 2009, pp. 1996–2003.

[11] Juan Carlos Niebles, Hongcheng Wang, and LiFei-Fei. “Unsupervised learning of human ac-tion categories using spatial-temporal words”.In: International journal of computer vision79.3 (2008), pp. 299–318.

8

[12] Mikel Rodriguez et al. “Data-driven crowdanalysis in videos”. In: Computer Vision(ICCV), 2011 IEEE International Conferenceon. IEEE. 2011, pp. 1235–1242.

[13] http://mplab.ucsd.edu. The MPLab GENKIDatabase, GENKI-4K Subset.

9

http://mplab.ucsd.edu

Figure 2: Final confusionmatrix of interaction pre-diction

Figure 3: Final confusionmatrix of focus prediction

Figure 4: Final confusionmatrix of happiness pre-diction

Figure 5: Final confusionmatrix of activity predic-tion

10

zeshan hussain, tariq patanam and hardie cate june 6, 2016 arxiv:1701.01885v1 [cs.cv ... ·...

Documents