journal of la recognizing stress using semantics and...

14
JOURNAL OF L A T E X CLASS FILES 1 Recognizing stress using semantics and modulation of speech and gestures *Iulia Lefter *†‡ , Gertjan J. Burghouts , and L´ eon J.M. Rothkrantz *‡ * Delft University of Technology, Delft, The Netherlands TNO, The Hague, The Netherlands The Netherlands Defence Academy, Den Helder, The Netherlands Abstract—This paper investigates how speech and gestures convey stress, and how they can be used for automatic stress recognition. As a first step, we look into how humans use speech and gestures to convey stress. In particular, for both speech and gestures, we distinguish between stress conveyed by the intended semantic message (e.g. spoken words for speech, symbolic meaning for gestures), and stress conveyed by the modulation of either speech and gestures (e.g. intonation for speech, speed and rhythm for gestures). As a second step, we use this decomposition of stress as an approach for automatic stress prediction. The considered components provide an intermediate representation with intrinsic meaning, which helps bridging the semantic gap between the low level sensor representation and the high level context sensitive interpretation of behavior. Our experiments are run on an audiovisual dataset with service-desk interactions. The final goal is having a surveillance system that would notify when the stress level is high and extra assistance is needed. We find that speech modulation is the best performing intermediate level variable for automatic stress prediction. Using gestures increases the performance and is mostly beneficial when speech is lacking. The two-stage approach with intermediate variables performs better than baseline feature level or decision level fusion. Index Terms—Stress, surveillance, speech, gestures, multi- modal fusion, semantics, modulation. I. I NTRODUCTION W HILE automatic detection of unwanted behavior is desirable and many researchers have delved into it, there are still many unsolved problems that prevent intelligent surveillance systems to be installed to assist human operators. One of the main challenges is the complexity of human behaviour and the large variability of manifestations which should be taken into consideration. Emotions and stress play an important role in the de- velopment of unwanted behavior. In [43] it is distinguished between instrumental and affective aggression. Instrumental aggression is goal directed and planned, e.g. pick-pocketing. Affective aggression results from strong emotional feelings, like anger, fear and frustration. Furthermore, there is a link between stress and aggression [3], [41]. Detecting negative emotions and stress at an early stage can help preventing aggression. However, an early stage is characterized by more subtle behavior compared to violence, which increases the difficulty of automatic detection. Stress is a phenomenon that causes many changes in the human body and in the way in which people interact. It * Corresponding author: I. Lefter (email: [email protected]). is a psychological state formed as response to a perceived threat, task demand or other stressors, and is accompanied by specific emotions like frustration, fear, anger and anxiety. For a thorough overview of stress we refer the reader to [28]. Since the final application of this work is in the surveillance domain, we are interested in the stress observed in the overall scene, and not per person. We consider the case of supervising human-human interac- tions at service desks based on audio-visual data. However, the service desk domain is considered as a proof of concept. Human operated service desks are places where cases of urgency, overload and communication problems are likely to occur. The situation is similar for virtual agent systems employed in public service, where stress can arise due to task complexity and the inability of the virtual agent to act as expected by the client. Our goal is to automatically detect when stress is increasing and extra assistance is needed. People use a variety of communicative acts to express semantic messages and emotion. Speech is used to com- municate via the meaning of words, as well as via the manner of speaking. Several other nonverbal cues like facial expressions, gestures, postures and other body language are used in communication. We are interested in how these verbal and nonverbal cues are used in conveying stress, and how they can be used to automatically assess stress. In particular, our attention focuses on speech and hand gestures (hereinafter called gestures), since they are rich sources of communication and promising for automatic assessment. While speech is generally considered the primary means of communication, [34] and [21] emphasize the importance of gestures. However, contradicting findings are presented in [25], where it is suggested that gestures have no additional communicative function compared to speech. Motivated by these contradictions on the communicative function of gestures in general, we study the role of gestures in assessing stress. For automatic behavior analysis, typically low level features are mapped to a ground truth. A known problem with this approach is the semantic gap between the high level context sensitive interpretation of behavior and the low level machine generated features. We expect that this problem is likely to occur in the case of automatic stress prediction, since stress is a complex concept and has a large variety of manifestation. A possible solution for bridging the semantic gap is to consider a decomposition of stress into variables that might be easier to predict, and build a final decision based on them.

Upload: lamdan

Post on 02-Apr-2018

212 views

Category:

Documents


0 download

TRANSCRIPT

JOURNAL OF LATEX CLASS FILES 1

Recognizing stress using semantics and modulationof speech and gestures

*Iulia Lefter∗†‡, Gertjan J. Burghouts †, and Leon J.M. Rothkrantz ∗‡∗Delft University of Technology, Delft, The Netherlands

†TNO, The Hague, The Netherlands‡The Netherlands Defence Academy, Den Helder, The Netherlands

Abstract—This paper investigates how speech and gesturesconvey stress, and how they can be used for automatic stressrecognition. As a first step, we look into how humans usespeech and gestures to convey stress. In particular, for bothspeech and gestures, we distinguish between stress conveyed bythe intended semantic message (e.g. spoken words for speech,symbolic meaning for gestures), and stress conveyed by themodulation of either speech and gestures (e.g. intonation forspeech, speed and rhythm for gestures). As a second step, we usethis decomposition of stress as an approach for automatic stressprediction. The considered components provide an intermediaterepresentation with intrinsic meaning, which helps bridging thesemantic gap between the low level sensor representation andthe high level context sensitive interpretation of behavior. Ourexperiments are run on an audiovisual dataset with service-deskinteractions. The final goal is having a surveillance system thatwould notify when the stress level is high and extra assistance isneeded. We find that speech modulation is the best performingintermediate level variable for automatic stress prediction. Usinggestures increases the performance and is mostly beneficial whenspeech is lacking. The two-stage approach with intermediatevariables performs better than baseline feature level or decisionlevel fusion.

Index Terms—Stress, surveillance, speech, gestures, multi-modal fusion, semantics, modulation.

I. INTRODUCTION

WHILE automatic detection of unwanted behavior isdesirable and many researchers have delved into it,

there are still many unsolved problems that prevent intelligentsurveillance systems to be installed to assist human operators.One of the main challenges is the complexity of humanbehaviour and the large variability of manifestations whichshould be taken into consideration.

Emotions and stress play an important role in the de-velopment of unwanted behavior. In [43] it is distinguishedbetween instrumental and affective aggression. Instrumentalaggression is goal directed and planned, e.g. pick-pocketing.Affective aggression results from strong emotional feelings,like anger, fear and frustration. Furthermore, there is a linkbetween stress and aggression [3], [41]. Detecting negativeemotions and stress at an early stage can help preventingaggression. However, an early stage is characterized by moresubtle behavior compared to violence, which increases thedifficulty of automatic detection.

Stress is a phenomenon that causes many changes in thehuman body and in the way in which people interact. It

* Corresponding author: I. Lefter (email: [email protected]).

is a psychological state formed as response to a perceivedthreat, task demand or other stressors, and is accompanied byspecific emotions like frustration, fear, anger and anxiety. Fora thorough overview of stress we refer the reader to [28]. Sincethe final application of this work is in the surveillance domain,we are interested in the stress observed in the overall scene,and not per person.

We consider the case of supervising human-human interac-tions at service desks based on audio-visual data. However,the service desk domain is considered as a proof of concept.Human operated service desks are places where cases ofurgency, overload and communication problems are likelyto occur. The situation is similar for virtual agent systemsemployed in public service, where stress can arise due totask complexity and the inability of the virtual agent to actas expected by the client. Our goal is to automatically detectwhen stress is increasing and extra assistance is needed.

People use a variety of communicative acts to expresssemantic messages and emotion. Speech is used to com-municate via the meaning of words, as well as via themanner of speaking. Several other nonverbal cues like facialexpressions, gestures, postures and other body language areused in communication. We are interested in how these verbaland nonverbal cues are used in conveying stress, and howthey can be used to automatically assess stress. In particular,our attention focuses on speech and hand gestures (hereinaftercalled gestures), since they are rich sources of communicationand promising for automatic assessment.

While speech is generally considered the primary meansof communication, [34] and [21] emphasize the importanceof gestures. However, contradicting findings are presented in[25], where it is suggested that gestures have no additionalcommunicative function compared to speech. Motivated bythese contradictions on the communicative function of gesturesin general, we study the role of gestures in assessing stress.

For automatic behavior analysis, typically low level featuresare mapped to a ground truth. A known problem with thisapproach is the semantic gap between the high level contextsensitive interpretation of behavior and the low level machinegenerated features. We expect that this problem is likely tooccur in the case of automatic stress prediction, since stress isa complex concept and has a large variety of manifestation. Apossible solution for bridging the semantic gap is to considera decomposition of stress into variables that might be easierto predict, and build a final decision based on them.

2 JOURNAL OF LATEX CLASS FILES

To summarize, we formulate our research questions asfollows:

1) What is the contribution of verbal and nonverbal com-municative acts in conveying stress?

2) Do gestures contribute to communicating stress andautomatic stress assessment, or is all the informationalready included in speech?

3) Which intermediate features can be used in the frame-work of automatically assessing stress, what are theirunique predictive values and which are good combina-tions that complement each other?

4) What is the impact of using intermediate level featurescompared to low level sensor features only?

To research these questions we use a dataset of audio-visualrecordings of human-human interactions at service desks. Therecordings contain a variety of scenarios in which stressfulsituations arise. The scenarios were freely improvised byactors, which means that the interaction was building upnaturally, by spontaneously reacting to one another.

Our first step is to study how humans perceive stress fromthe recordings, based on cues from speech and gestures. Wepropose a human model for conveying and perceiving stressby speech and gestures, and based on annotations, we analyzewhich are the dominant communicative acts in conveyingstress. As a second step, we propose a model for automaticstress assessment. It is based on a three level architecture, andthe intermediate level is inspired by the variables proposed inthe human model.

This paper is organized as follows. In section II we givean overview of relevant related work. Next, in section III, wepresent our models for conveying and automatically assessingstress which we use for answering the research questions.We continue in section IV with descriptions of the datasetand its annotations. Section V focuses on the intermediatelevel variables proposed for the automatic stress assessmentframework. Next, we give details on the experimental setupin section VI, including a description of the low level featuresused for automatic stress assessment, the classification proce-dure and segmentations. Details on how stress is conveyed byspeech and gestures based on the human model are presentedin section VII. We present and compare our automatic stressprediction method to a baseline stress prediction from lowlevel features in section VIII. The paper ends with a summaryand conclusions in section IX.

II. RELATED WORK

This section highlights relevant related studies, ranging fromidentifying stress cues in speech and gestures, to automaticbehavior recognition based on speech, gestures and multi-modal data. We end with outlining works that use intermediaterepresentations and explain what makes our approach different.

Many studies investigate clues in verbal and nonverbalcommunication, and how they are used for communication andaffective displays. A comprehensive set of acoustic cues, theirperceived correlates, definitions and acoustic measurements invocal affect expression is provided in [18]. The same work alsoprovides guidelines for choosing a minimum set of features

that is bound to provide emotion discrimination capabilities.An extensive study [38] presents an overview of empiricallyidentified major effects of emotion on vocal expressions. In[36] more emphasis has been put on voice and stress. Themost important acoustic and linguistic features characteristicfor emotional states from a corpus of children interacting witha pet robot are identified in [2]. In [10] and [11], differentcategories of nonverbal behavior are identified, part of themhaving the function of communicating semantic messagesand part of them of transmitting affect information. Theseinvestigations point to the suitability of considering speechand gestures for assessing stress.

The research described in [17] focuses on analyzing, mod-eling and recognizing stress from voice, using recording ofsimulated and actual stress [16]. The study in [33] considersdiscriminating normal from stressed calls received at a call-center. Multiple prediction methods applied to different featuretypes are fused, resulting in a significant improvement overeach single prediction method. The work of [32] focuses onautomatic aggression detection based on speech.

The relationship between emotion and hand gestures (morespecifically handedness, hand shape, palm orientation andmotion direction) has been investigated in [23]. The studyconcluded that there is a strong relation between the choiceof hand and emotion. An interesting approach for recognizingemotion from gestures is presented in [9]. Instead of tryingto recognize different gesture shapes that express emotion,the authors use nonpropositional movement qualities (e.g.amplitude, speed and fluidity of movement). Most relevantfor our work is the approach in [9], which proves thatemotion recognition from gestures can be performed withoutspecifically recognizing the gestures, but rather focusing onhow they are done.

The added value of using multimodal information instead ofunimodal has been highlighted in previous studies. Researchin [8] addresses multimodal emotion recognition using speech,facial expressions and gestures. Their algorithms were trainedon acted data for which actors were supposed to show aspecific gesture for each emotion. Fusing the three types offeatures at both feature and decision level improved over thebest performing unimodal recognizer. Fusion of cues in theface and in gestures have been used for emotion recognitionin [14]. Again, a gesture type has been considered for eachemotion. For surveys on multimodal emotion recognition werefer the reader to [15] and [44]. Future directions outlinedin [44] include developing methods for spontaneous affectivebehavior analysis, which are addressed in this paper.

Several studies reflect the benefits of using intermediaterepresentations in order to automatically assess a final concept.In [12], the focus is on detecting violent scenes in videosfor protection of sensitive social groups by audio-visual dataanalysis. A set of special classes are detected, like music,speech, shots, fights and screams and the amount of activityin video. From video, the amount of activity in the scenewas used to discriminate between inactivity, activities withouterratic motion and activities with erratic motion (fighting,falling). A two stage approach for detecting fights based onacoustic and optical sensor data in urban environments is

3

presented in [1]. Low level features are used to recognize aset of intermediate events like crowd activity, low sounds andhigh sounds. The events based on video data are related tothe behavior of crowds: normal activity, intensive activity bya few or by many persons, small crowd or large crowd, andalso it is distinguished between different categories of sounds.The highest performance is achieved by fusing the cues fromboth modalities, again pointing that multimodal fusion is aninteresting direction to explore for stress prediction. In thework of Grimm et al. [13], the continuous representation ofemotions composed of valence, activation and dominance wasused as an intermediate step for classifying emotion categories.

Interesting studies can be found in literature both withrespect to automatic emotion assessment as well as automaticsurveillance. We see our work to be at the intersection betweenthese two fields. What makes our research different is that wetake inspiration from the human model of stress perceptionadding intermediate level variables based on how and whatspeech and gestures communicate stress. The considered casestudy is challenging due to the nature of the data. Unlike inthe mentioned emotion recognition studies, we do not have aspecific gesture type that appears per emotion. A large varietyof gestures appear spontaneously but also many times there areno gestures or they are not visible, and speech is spontaneousand sometimes overlapping. Our contribution is a novel two-stage stress prediction model from low level features using anintermediate level representation of gestures and speech.

III. METHODOLOGY

In this section we present the methodology for our research.Since our final goal is to develop an automatic system that isable to assess stress, we regard stress from two perspectives.The first perspective is the one of human expression andperception of stress. In this case the focus on how peopleconvey and perceive stress and what communication channelsthey use, with a focus on speech and gestures. The secondperspective is the one of an automatic stress recognition systembased on sound and vision, and available low level features.The components of our model for automatic stress assessmentare inspired from the human model. Our analysis is performedfrom the perspective of automatic surveillance. In that sense,by stress we refer to the global stress perceived in the sceneand not to the stress of one person. In the study based on thehuman model as well as in the study related to automatic stressassessment, we use footage from one audio-visual camera andconsider all audible speech and all visible gestures.

A. Model of stress expression and perception using speech andgestures from a human perspective

The model we present in this section is a basic model thatis not meant to be complete or novel. Rather, it has the goaland function of enabling us to operationalize phenomena aboutstress that relate to gestures and speech.

Following [40] and [24], we use the term semantics todenote the information that contributes to the utterance’sintended meaning. Following [20], we use the term modulation

to denote the part of the message that was transmitted by howthe semantic message was delivered.

For both speech and gestures, we observe that there aretwo ways in which they can communicate stress, which referto what is being communicated (the semantic message) andhow the message is being communicated (modulation). Forexample, speech can convey stress by the meaning of thespoken words (e.g. “I am nearly missing the flight.”), but alsoby the way in which the utterance was voiced (e.g. someonespeaks loud, fast, with a high pitch). In analogy to speech,gestures can also communicate stress in two ways: by thecommon meaning of the gesture - what the gesture is saying(e.g. a pointing to self sign), and by how the gesture is beingdone (e.g. a gesture is sudden, tense, repetitive).

Stress perception

Semantics

Modulation

Semantics

Modulation

Speech

Gesture

Stress expression

Fig. 1. Human model of conveying and perceiving stress by the componentsof speech and gestures.

As illustrated in Figure 1, the proposed model consists offour components:

• Speech Semantics Stress. The extent to which stress isconveyed by the spoken words (the linguistic component).

• Speech Modulation Stress. The extent to which stressis conveyed by the way the message was spoken (theparalinguistic component of conveying stress).

• Gesture Semantics Stress. The extent to which a gestureis conveying stress based on its meaning i.e. the interpre-tation of the sign.

• Gesture Modulation Stress. The extent to which agesture is conveying stress based on the way it is done, i.erhythm, speed, jerk, expansion, and tension of the gesture.

We are interested in how stress is conveyed by these fourmeans of communication. In section VII we explore if any ofthem is dominant, how they are correlated and how well theirannotated labels perform in predicting stress.

B. Model for automatic stress assessment using intermediatelevel variables

As defined in the introduction, we are interested in whichcomponents of speech and gestures are good clues when itcomes to assessing stress. Starting from the human model ofperceiving stress as introduced in section III-A, we propose athree level architecture for automatic stress assessment.

The automatic stress assessment model is depicted in Figure2. The low level consists of sensor features, the intermediatelevel of variables related to stress inspired from the humanmodel, and the last level of the final stress assessment.

By using a classifier and the labels for the stress variable, wecan compute the relation between the low level acoustic andvideo features and the high level stress variable. This enablesus to automatically compute the stress level based on low levelfeatures. We refer to this method as the baseline model.

4 JOURNAL OF LATEX CLASS FILES

A drawback of the baseline model is the semantic gapbetween the low level features and the high level interpretation.To improve the baseline automated model, we defined vari-ables on the intermediate level. There is a difference betweenthe acoustic and video features from the low level and thevariables in the intermediate level. The acoustic and videofeatures are computed automatically but they have no intrinsicmeaning. The intermediate level variables are automaticallycomputed as well, yet in contrast to the low level feature theydo have a meaning with respect to stress.

High0LevelIntermediate0Level

Stress

Speech0Modulation

Speech0Arousal

Speech0Valence

Words0Topic

Gesture0Modulation

Gesture0Arousal

Gesture0Valence

Gesture0Topic

Low0Level

Acoustic0Features(F0,0Intensity,0etc.)

Words

Video0Features(STIP)

Acoustic0FeatureExtraction

SpeechTranscription

Video0FeatureExtraction

Fig. 2. Automatic model for assessing stress using low level features andintermediate level variables.

A mapping between the human model variables in Figure1 and the intermediate level variables in Figure 2 has beenindicated by colors. We expect that the stress expressed bySpeech Modulation and Gesture Modulation (yellow) canbe estimated from low level features, and therefore thereis a direct mapping of these two variables across the twomodels. Speech Semantics Stress and Gesture Semantics Stress(purple) are more complex semantic variables, and thereforein the automatic model we propose to operationalize them byvalence (ranging from positive to negative), arousal (rangingfrom passive to active) and a limited set of topics (contentdependent clusters of words or gestures). Valence and arousalare measuring emotion in a continuous manner, so we expectthat they can also be an indication of the stress level. Thechosen speech and gesture topics are context dependent, andthey contain semantic information on the amount of stress butalso give qualitative insight into the type or cause of stress.The bar on the right side of the figure shows the stress levelranging from no stress (green) to very high stress (red).

In the next paragraphs we give a flavor of how stressprediction using our model with intermediate level variablesworks. More details about the intermediate level variables andhow the ground truth was established for each of them areprovided in section V, while the experimental setup includingthe low level features is described in section VI.

The intermediate level variables are computed automaticallybased on human annotations. For Speech Modulation Stressand Gesture Modulation Stress we use the annotations from thetesting of the human model. For speech valence and arousal,we use the ratings in the ANEW database [6], for which manyrespondents were asked to rate words on a valence and arousalscale. We compute valence and arousal scores based on thespoken words and their occurrences and values in the ANEWlist. We notice that the valence and arousal scores are based onhuman semantic interpretation which is an added value to the

low level features. Another option would have been to computethe valence and arousal variables based on the acoustics ofspeech as was done in [19], but in our research we preferto use the acoustics of speech for Speech Modulation Stressand the words for valence and arousal. The gesture valenceand arousal are also based on human annotations. They arebased on annotations of our recordings because an equivalentof ANEW for gestures is not available. The word topics andgesture topics are chosen based on key topics that are likely toappear during stressful interactions at a service desk, so theyare context dependent. The word topics are computed basedon appearances of specific words, and the gesture topics resultfrom a human clustering of the gesture classes available in theservice desk corpus.

The low level variables consist of features extracted fromthe audio and video stream. From speech we extract acousticfeatures, e.g. fundamental frequency (F0), intensity and voicequality features. In a fully automatic system, we would use aspeech recognition system to receive the spoken words. Forthe purpose of this paper we assume perfect recognition anduse manual transcription of the words instead. From the videostream we extract features related to movement and appear-ance. Ideally, there would be a symmetry between audio andvideo, and a gesture recognition unit would output a gesturetype for which we would have valence and arousal scores aswell as an associated meaning. Although gesture recognitionis actively studied, to the best of our knowledge there is nogesture recognition module available to distinguish betweenthe subtle and complex variations that appear when stress isconveyed. Therefore, we use the low level video features topredict the gesture-related intermediate level variables.

Below we give an account of the chosen variables that formthe intermediate level:

• Speech Modulation Stress. The extent to which speechis conveying stress by how it sounds (e.g. prosody).This variable was annotated on a 3 point scale andautomatically predicted from the acoustic features.

• Speech Valence. How positive or negative the spokenwords are. The value of this variable is computed basedon the valence values of words in the ANEW list.

• Speech Arousal. How active or passive the spoken wordsare. The value of this variable is computed based on thearousal values of words in the ANEW list.

• Speech Topics. Based on keywords, we have createdclasses of words which have a relation to the level orthe source of stress. Examples of speech topics are beinghelpless, late, aggressive and insulting.

• Gesture Modulation Stress. The extent to which themanner in which the gesture was done conveys stress.This variable was annotated on a 3 point scale andautomatically predicted from the video features.

• Gesture Valence. We create a ground truth based on thevalence annotation available for the database for gestureinstances of each class. This variable is predicted fromvideo features.

• Gesture Arousal. We create a ground truth based on thearousal annotation available for the database for gestureinstances of each class. This variable is predicted from

5

video features.• Gesture Topics. We provide a clustering based on the 60

classes of gestures available in the dataset annotation. Wedefine groups of gestures which have a semantic meaningwith respect to stress which we call gesture topics. Exam-ples of gesture topics are: explaining gestures, inner stressgestures, extrovert stress and aggressive. This variable ispredicted from video features.

The idea behind this model was inspired by the largevariability of manners in which people can express stressjust by their voice and hand gestures, which makes the taskof automatically assessing stress challenging. Furthermore,valence and arousal are entities used on many emotion relatedapplication and we expect that especially their combinationcorrelates well with stress.

IV. DATASET OF HUMAN-HUMAN INTERACTION AT ASERVICE-DESK

In order to test our stress models we validate them using ourcorpus of audio-visual recordings at a service desk, introducedin [29]. The dataset has been specifically designed for surveil-lance purposes. It contains improvised interactions of actorsthat only received roles and short scenario descriptions. Thedataset is rich and includes various manifestations of stress.However, the actors did not receive any indication at all toencourage them to use their voice or body language in anyparticular way.

There have been debates on the use of acted corpora inemotion recognition related research. We argue that the useof actors is suitable for our purposes, due to the followingconsiderations. Real-lived emotions are rare, very short, aswell as constantly manipulated by people due to their desireto follow their strategic interests and social norms. Elicitingstress and negative emotions is not considered ethical. Fur-thermore, behavior of people in a real stressful situation willbe influenced when they know that they are being recorded,which can also be considered as acting. In [37] it is beingdistinguished between push (physiologically driven) and pull(social regulation and strategic intention) factors of emotionalexpression. In interactions such as the ones from our scenariosat the service desk, we expect pull factors to have an importantrole. Since our aim is to study overt manifestations fromthe perspective of surveillance applications, we consider theuse of actors appropriate. Even though actors are employed,the procedure was carefully designed to generate spontaneousinteractions between actors.

In the remainder of this section we summarize the content,annotations and segmentations of the dataset, as they will beused in the experiments.

A. Content description

The audio-visual recordings contain interactions improvisedby eight actors. They had to play the roles of service deskemployees and customers given only the role description andshort instructions. Four scenarios were played two times,resulting in eight sessions, for which the actors did not seethe performance of their colleagues from the other session. It

resulted in realistic recordings, since the interaction betweenthe actors built up as a result of the other’s reactions whichwere not known beforehand. Example scenarios are a visitorwho is late for a meeting and has to deal with a slow employee,a helpless visitor unable to find a location on a map asking theemployee to be escorted but he is being refused, the servicedesk employee does not want to help because of his lunchbreak and the employee or a visitor is in a phone conversationand blocking the service desk.

Two cameras were used to record the interaction, and inour research we use the camera facing the visitor since weexpect that most of the times the visitor is experiencingstress. However, it is sometimes the case that the employeeis also visible in the camera facing the visitor. In caseswhen the employee is gesturing as well and that is visible,his/her gestures are also considered in our research. The totalrecordings span 32 minutes.

B. Annotations and segmentation

The data as presented in [29] has annotations for the stresslevel by two annotators, utterance and gesture segmentations,speech transcriptions, as well as annotations of other dimen-sions such as valence and arousal of gestures, and classes ofgestures. In addition to what was already available, in thispaper we have added two more annotators for the stress level,resulting in an inter-rater agreement of 0.75 measured as Krip-pendorff’s alpha. For segments that received different stresslevels by the annotators, they had to check it and come togetherto a decision. Furthermore, we have annotated the degreeof stress expressed by the four communication componentsproposed in section III-A of this paper. These annotationswere performed by one expert annotator and checked by asecond one. Every time there would be a disagreement, thefinal ground truth would be set after discussion. For moredetails on the database and its annotations we refer to [29].

The utterance and gesture segmentations were done usingthe following procedure. As a basis, speech was segmentedinto utterances. First, borders were chosen based on turntaking. Whenever a turn was too long (i.e. longer than 10seconds), it was split based on pauses. Separately, we havesegmented the gestures that appear in the dataset. In general,we regard a gesture as one instance of a gesture class. Agesture class refers to gestures that have the same meaning andappearance. In the case of isolated gestures (i.e. not connectedto other gestures or movements), the borders of the gesturesare chosen by the annotators such that the segment includes theonset, apex and offset of the gesture. Gestures which include arepetitive movement, e.g. tipping of fingers were regarded as awhole, so there were no additional segments for each tippingmovement. If a repetitive gesture was longer than 10 seconds,it was split into shorter segments. The two segmentations, andthe annotation are outlined in Figure 3. The figure also showsthe annotation software Anvil [22] used by the raters.

The annotations used in this paper are:• Stress. The perceived stress level in the scene was anno-

tated on a 5 point scale based on multimodal assessmentat the utterance level. In our experiments we simplify the

6 JOURNAL OF LATEX CLASS FILES

Stress

Wave

Text

Sp_Modulation

Sp_Semantics

Gest_Modulation

Gest_Semantics

Fig. 3. Annotations for stress, Speech Semantics Stress, Speech ModulationStress, Gesture Semantics Stress and Gesture Modulation Stress in Anvil [22].

problem to a 3 point scale, where classes 2 and 3 aretreated as a class of moderate stress and classes 4 and 5as high stress.

• Speech Semantics Stress. The extent to which stress isconveyed by the spoken words, annotated on a three pointscale, at the utterance level.

• Speech Modulation Stress. The extent to which stress isconveyed by the way the message was spoken, annotatedon a three point scale, at the utterance level.

• Gesture Semantics Stress. The extent to which a gestureconveys stress based on its meaning, annotated on a threepoint scale, at the gesture level.

• Gesture Modulation Stress. The extent to which agesture conveys stress based on the way it is done,annotated on a three point scale, at gesture level.

• Text transcriptions. The text spoken by the actors hasbeen transcribed. Besides the original words, severalother sounds have been annotated: sighs, overlappingspeech and rings from the bell available at the desk. Thetranscriptions are done at the utterance level.

• Valence and arousal of gestures. In [29] we identified60 classes of gestures that have the same meaning andappearance. One representative instance of each classwas annotated for valence and activation using the selfassessment manikin [6]. The annotations of the selectedgestures for each class were performed by 8 raters usingonly the video (no sound) and another 8 raters in a multi-modal setup. In this work we use the average annotationof all 16 raters. The granularity for these annotationswas a 9 point scale, but we observed that almost all thedata falls into the positive arousal and negative valencequadrant. To simplify the problem, we map to neutral thefew labels for negative arousal and positive valence. Weend up with a 5 point scale for each dimension, whichwe reduce to a 3 point scale by coupling labels 2 and3 together, and 4 and 5 together. The meaning of thenew labels is 1 - neutral arousal, neutral valence, 2 -medium arousal, medium negative valence and label 3 -high arousal and strongly negative valence.

• Gesture class. The 60 classes of gesture that have thesame meaning and appearance found in [29] are used inthis paper for gathering the Gesture Topics.

V. INTERMEDIATE LEVEL FEATURES

We continue with a description of the intermediate levelvariables from our model of automatic stress assessment,which were introduced in section III-B and Figure 2. We candivide them into two categories: variables for which we havean annotated ground truth and which we predict from low levelvisual and acoustic features (to be discussed in section VI-A),and variables for which there is no ground truth available andfor which we compute values based on simple algorithms. Theprediction methodology is presented in section VIII-B. Forboth speech and gestures, given the complexity of semanticmessages, the Semantic component was operationalized bythree entities: Valence, Arousal and Topics.

A. Speech Modulation Stress and Gesture Modulation Stress

Speech Modulation Stress and Gesture Modulation Stressare the two variables which have been maintained from ourhuman model. As a ground truth we use the dataset annota-tions described in section IV-B. Speech Modulation Stress ispredicted from the acoustic features, and Gesture ModulationStress is predicted from the visual features.

B. Speech Valence and Speech Arousal

Speech Valence and Speech Arousal have no ground truthsavailable. Our approach is to use a simple technique basedon previous work for creating lists of words with valence andarousal scores. The Dictionary of Affect in Language (DAL)[42] and Affective Norms for English Words (ANEW) [6] theconsidered options. We use ANEW because it contains lessbut more emotion related words.

Since a significant part of the transcriptions is in Dutch, afirst step was to use machine translation and obtain the Englishversion. To maximize the matching between our words and thewords from the ANEW list, we have applied stemming. Thescore of each utterance is initialized to neutral valence andarousal values. All the words from the utterance are lookedup in the ANEW list. If matches are found, a new score iscomputed by averaging over the valence and arousal scoresof the matching words. We find that 23% of the utterancescontained words from the ANEW list, therefore the majorityof the scores still indicated neutral valence and arousal.

C. Word Topics

Since we consider the service desk domain, we found thatthere are at least a number of topics which are likely tocome up during visitor-employee interactions at the servicedesk which might be indications of stress. We defined fiveclasses of words indicating typical problems, and we haveadded the three nonverbal sounds which are annotated in thetextual transcription. Together, they result in eight topics:

• Being late. Many times visitors at a service desk arestressed because they are late for a meeting. Keywordsexamples: ‘late’, ‘wait’,‘hurry’,‘quickly’,‘immediately’.

7

• Helpless. This indicated when visitors have difficultiesfinding what they need. Keywords include: ‘no idea’,‘need’, ‘help’, ‘difficult’, ‘find’,’problem’.

• Dissatisfied. Keywords examples: ‘annoying’, ‘unkind’,‘manager’, ‘rights’,‘ridiculous’, ‘quit’.

• Insults. Keywords include curse and offensive words.• Aggressive. Keywords examples: ‘attack’, ‘sweat’,

‘guard’, ‘touched’, ‘hurts’, ‘push’, ‘police’, ‘control’.• Ring. Annotated whenever someone would ring the bell

to be found on the service desk.• Overlap. Annotated when there was overlapping speech.• Sigh. Sighs of the actors were also annotated.Based on these topics, an utterance is described by an

8-dimensional feature vector containing counts of keywordsfrom each topic. The number of occurrences of words fromeach topic can already give an impression of the content ofthe data. The most frequent topics are Late (20% of the utter-ances), Helpless (14%) and Dissatisfied (27%). Insults (2%)and Aggressive (4%) are less frequent. From the nonverbalindicators, the most frequent is Overlap (9%), followed bySigh(3%) and Ring (2%).

D. Gesture Valence and Gesture Arousal

For Gesture Valence and Gesture Arousal we use the exist-ing annotations for the representative gestures chosen fromeach class (recall section IV-B). Even though valence andarousal of a gesture class might vary due to its modulation, weextend the label for one gesture class to all gestures in thatclass. The distribution of the resulting labels for arousal is14% for class 1 (neutral), 61% for class 2 (medium) and 25%for class 3 (high arousal). For valence, the proportion of themiddle class was even higher - 69%, and the rest was formedfrom 19% class 1 (neutral) and 12% class 3 (very negative).

E. Gesture topics

We want to recognize gestures for the purpose of stress pre-diction. Ideally, we would be able to recognize automaticallyall the gesture classes which were identified in the dataset.However, for many gesture classes there are very few oreven just one example available. Taking into account our finalapplication, identifying a smaller number of general categoriesof gestures might still be sufficient for stress prediction. Ourapproach was therefore to create a manual clustering of the60 gesture classes into 6 classes, based on their meaning. Wecall each newly formed class a gesture topic. All the gestureswhich were assigned the same topic will have the same newlabel, and our problem transforms to recognizing to whichtopic a gesture belongs.

The six topics are chosen by studying the applicationdomain and the 60 classes that were already found. We definethe following 6 topics, and examples from each of them aredepicted in Figure 4:

• Explaining. Gestures used to visually complement alinguistic description, explaining a concept, pointing di-rections, pointing to self and to others.

• Batons. As defined in [10] batons are gestures thataccent or emphasize particular words or phrases. They

are simple, repetitive, rhythmic movements with no clearrelation to semantic content of speech.

• Internal stress. Gestures that point out to stress but with-out any signs of aggression or tendency to react. Examplegestures are putting hand on the forehead, cleaning sweat,fidgeting, putting hands close to mouth, nose. This isthe category in which self-adaptors (movements whichoften involve self touch, such as scratching) occur. Selfadaptors are known to be indicative of stress. In [35]the effects of self adaptors as well as other languageand gesture aspects on perceived emotional stability andextroversion are studied.

• Extrovert low. Gestures that show stress and a tendencyto make it visible to the opponent. E.g. tipping fingers orpatting palm on the desk, pointing to watch.

• Extrovert high. Gestures that are clear indications ofhigh stress, one step before becoming aggressive. Forexample slamming fists on the desk, wild movement withhands showing a lot of dissatisfaction.

• Aggressive. Gestures which clearly show aggression, likepushing somebody, slamming objects, throwing objects.

While we do not claim that these gesture topics provide aprecise understanding of what a gesture means, we argue thatan increase in stress probability can be observed from the firstto the last gesture topic. We think these gesture topics offer agood coverage of the gesture types occurring in the database,and find them interesting candidates for stress prediction.

VI. DATA PROCESSING APPROACH

This section provides details on the experimental setup.It contains a description of the low level features used forautomatic stress assessment, as well as information about thesegmentation and classification procedure which are relevantfor both the automatic stress assessment and for the analysisbased on annotations of how the speech and gesture compo-nents of our human model convey stress.

A. Low level feature extraction

As described in section III-B, the acoustic and visual lowlevel features are used to predict the intermediate variablesand alternatively directly the stress level.

1) Acoustic features: The acoustic features capture thenonverbal part of speech, namely, whether the manner inwhich somebody is speaking can convey signs of stress. Vocalmanifestations of stress are dominated by negative emotionssuch as anger, fear, and stress. For the audio recognizers weuse a set of prosodic features inspired from the minimumrequired feature set for emotion recognition proposed by [18]and the set proposed in [33]. The software tool Praat [4] wasused to extract these features. Note that there are many moreinteresting features that can improve recognition accuracies.Example features that can be added are roughness, mode,MFCC, and spectral flux which can be extracted using theMIR toolbox [27], or feature sets from the INTERSPEECHparalinguistic challenge [39]. The goal of the paper is toexplore the merit of an intermediate-level representation, andto assess this by verifying the relative improvement. Absolute

8 JOURNAL OF LATEX CLASS FILES

Explaining

Batons

Internal stress

Extrovert low

Extrovert high

Aggressive

Fig. 4. Examples of gestures from each gesture topic.

performance can potentially be improved by adding morefeatures, which is accommodated by the proposed frameworkand representation.

The hand crafted features set consists of the followingfeatures: speech duration without silences (for each utterancethe speech part was separated from silences using an algorithmavailable in Praat [4]), pitch (mean, standard deviation, max,mean slope with and without octave jumps, and range), in-tensity (mean, standard deviation, max, slope and range), firstfour formants (F1-F4) (mean and bandwidth), jitter, shimmer,high frequency energy (HF500) (HF1000), harmonics to noiseration (HNR) (mean and standard deviation), Hammarbergindex, spectrum (center of gravity, skewness), long term av-eraged spectrum (slope). As unit of analysis, we used theutterance segmentation.

2) Visual features: We expect that the most relevant fea-tures for stress detection are based on movement. We choseto describe the video segments in terms of space-time interest

points (STIP) [26], which are compact representations of theparts of scene which are in motion. Originally these featuresare employed for action recognition. They were successfullyused in recognizing 48 actions in [7], while in [30] they weresuccessfully applied for recognizing degrees of aggression.Employing more advanced features such as trajectory basedfeatures for genre detection can be included in the framework,but the purpose of this paper is to evaluate the use of theintermediate representation.

The space-time interest points are computed for a fixed setof multiple spatio-temporal scales. For the patch correspondingto each interest point, two types of descriptors are computed:histograms of oriented gradient (HOG) for appearance, andhistograms of optical flow (HOF) for movement. For featureextraction we used the software provided by the authors [26].

These descriptors are used based on a bag-of-words ap-proach, following [26]. We computed specialized codebooks,but instead of using K-means as in the original paper, thecodebooks were computed in a supervised way using RandomForests with 30 trees and 32 nodes. We used the gesturesegmentation provided in the dataset. The visual intermediatelevel features were predicted only for data for which gestureswere available. The resulting feature vectors were reducedusing correlation based feature subset selection.

B. Classification

All classification tasks were performed using a Bayes Net(BN) classifier. Each prediction task was treated as a set ofbinary one-against-all problems. The final prediction label wasthe one corresponding to the maximum posterior.

Due to the nature of human communication, and our ut-terance level and gesture level segmentations, there are datasegments for which there are no gestures. These segments werehandled by the Bayes Net classifier as missing data.

For almost all our prediction tasks there was an unbalancewith respect to number of samples for each class. Therefore,we have tested the benefits of applying Synthetic MinorityOver-sampling Technique (SMOTE) [5] to balance the data.The artificial samples were created for the training set and theproportion to be created was chosen such that the number ofsamples in the minority class should be at least 70% of thenumber of samples in the majority class.

As fusion methods, we experiment with feature and decisionlevel fusion. Feature level fusion (FLF) means that the featurevector is formed by concatenating the low level acoustic andvideo features. In decision level fusion (DLF) there are twostages of classification. In the first stage the acoustic and videolow level features are used separately to predict a specificground truth. In the second stage, the scores obtained fromthe first stage are used as features to predict the stress label.

All experiments were performed in a leave-one-session-out(LOSO) cross-validation framework. As performance mea-sures, we report weighted (WA) and unweighted (UA) accu-racies, to take into account data imbalance (recall Figure 6).Confusion matrices are used to have a more detailed viewon the performance of the classification task. Given the dataunbalance and the fact that high recognition rates are desired

9

for all classes, the unweighted average (UA) is considered themain performance measure.

C. Segmentation

The database contains two segmentations, one based onutterances, which was used for Stress and Speech ModulationStress annotations, and one based on gestures. Many times theboundaries of utterances and gestures are not the same.

For prediction tasks involving both speech and gesturerelated variables, we adopt a new segmentation. For everyutterance and gesture border, there is a border in the newsegmentation. The values of the variables are transferred basedon timing. Take for example the case of two utterance withlabels u1 and u2, and a gesture with label g1 starting withinthe first utterance and ending within the second utterance.In the new segmentation, there will be two utterances withlabel u1 with the middle border where the gesture began, twoutterance with label u2 separated where the gesture ended,and two gestures both with label g1. This new segmentationwas used only for the experiments that involved multimodalassessments of stress, and it was combined with a LOSOcross-validation scheme. In total 1066 samples were used forclassification, out of which 486 contained gestures.

u2u1g1

u2u1g1

u1 u2g10 0

UtterancesegmentationGesture

segmentation

Newsegmentation

Fig. 5. The figure shows how a new segmentation was obtained based on theinitial utterance and gesture segmentations.

VII. ANALYSIS OF HOW SPEECH AND GESTURES CONVEYSTRESS BASED ON THE HUMAN MODEL

In this section we explore how stress was expressed bythe communication components defined in section III-A. Theaim is to understand what are the dominant communicationchannels for conveying stress, which we expect to be alsothe most relevant in case of automatic prediction. For thistask we use the labels described in section IV-B and the newsegmentation explained in section VI-C.

A. Analysis of the correlations between the communicationcomponents and stress

As a first step in our study, we are interested in correlationcoefficients between the stress labels and the four annotatedcommunication components. The annotated variables are or-dinal and therefore we use Spearman correlation coefficients.Since part of the communication components are only gesturerelated and were annotated only when gestures appeared, wecompute correlations for the variables which were available forall data, and separately for all variables only on the gesturesegments. There were no significant changes in the correlationcoefficients of the variables available for all data, if they were

computed only on the gesture segments. Therefore, in Table Iwe show the correlations for all variables for the data segmentsthat did contain gestures.

TABLE ICORRELATION COEFFICIENTS (SPEARMAN) BETWEEN STRESS AND THEPROPOSED COMMUNICATION COMPONENTS. SSS = SPEECH SEMANTICS

STRESS, SMS = SPEECH MODULATION STRESS, GSS = GESTURESEMANTICS STRESS AND GMS = GESTURE MODULATION STRESS.

Stress 1SSS .76 1SMS .46 .53 1GSS .52 .48 .25 1GMS .15 .10 .10 .29 1

Stress SSS SMS GSS GMcS

Stress is the most correlated with Speech Modulation Stress,followed by Speech Semantics Stress and Gesture Modula-tion Stress. A lower correlation coefficient is observed forStress and Gesture Semantics Stress. Therefore, we expectthe first three to be good measures for predicting stress.Gesture Semantics Stress was the least correlated with Stress.This implies that when gestures are used during stress, theirmodulation will give more indications of the emotional statethan their semantics. Examples of gesture in this databasethat fall in this category are pointing gestures, which havesemantically no relation to stress, but by their modulationcan indicate stress. Apparently gestures conveying stress bytheir semantics, as for example insulting gestures, tappingor slamming hands on desk, or gestures with an aggressivetendency are less frequent.

We also computed in between communication componentscorrelations. The two speech related components are highlycorrelated, as well as the SMS and GMS. From these values weexpect that when the semantics of speech conveys more stress,that will be also noticeable from speech modulation. Also,when speech modulation is indicating more stress it is likelythat it will be accompanied by gestures that also show morestress. One cause for the lower correlation between Stress andthe gesture variables can be the fact that stress was assessedas a general measure of the scene, and not per person. Thiscan lead to lower correlations when one actor is visible andthe other one is speaking.

B. Characteristics of speech and gestures for different stresslevels

In this section we have a closer look on the use of thecommunication components per stress level. Figure 6 presentsthree histograms and a number of gesture examples. Each his-togram illustrates, given a specified stress level, the distributionof the annotations for the four stress communication compo-nents. The colors are associated with the four components,and the height of the bar represents the number of occurrencesgiven the specified stress level. A missing bar means that therewere no occurrences for that variable given the consideredstress level.

Note that the three histograms were not normalized, andthe numbers of occurrences on the vertical axis are applicable

10 JOURNAL OF LATEX CLASS FILES

for all three of them. We do not normalize them because inthis way they give an impression on how the annotations weredistributed with respect to the stress level: 33% of the data fallinto the no stress case, 49% into middle stress and 18% intohigh stress. With the gesture pictures we highlight interestingcases of gestures from different bins of the histograms. Theirappurtenance to bins in the histograms is indicated by arrows.

The left histogram in Figure 6 contains the distribution forwhen there was no stress. It can be observed that for thesesegments, Speech Modulation Stress and Speech SemanticsStress have predominantly label 1, so they also do not coveystress. Also, a significant amount of 62% of these segmentsdid not contain any gestures. When gestures appeared, theywere assigned label 1 or 2 for Gesture Modulation Stress andGesture Semantics Stress. The most interesting situations arewhen they were labeled 2, since this is not expected for the nostress cases. Looking into examples from the data, we find thatthese gestures usually indicate little stress, like scratching heador keeping open palms on the desk. They do show the personis feeling a little tense, but from the overall conversationand given the context, the segment was not labeled as beingstressful. Examples of such gestures are in the first column ofgesture images from Figure 6.

# S

ampl

es

1: No Stress (MM) 2: Medium Stress (MM) 3: High Stress (MM)

GS

S ≠

MM

S

tres

s E

xam

ples

GM

S ≠

MM

Str

ess

Exa

mpl

es

0

50

100

150

200

2: Med. 3:High 1: No 2: Med 3: High1: No

SMSSSS

GMSGSS

# S

ampl

es

Stress level per communicationcomponents

Stress level per communicationcomponents

Stress level per communicationcomponents

# S

ampl

es

1: No 2: Med 3: High0

50

100

150

200

0

50

100

150

200

Fig. 6. The histograms represent the distribution of labels for Speech Mod-ulation Stress (SMS), Speech Semantics Stress (SSS), Gesture ModulationStress (GMS) and Gesture Semantics Stress (GSS) given a specified levelof stress from the multimodal annotation (MM). We expect that the fourcommunication components will indicate the same stress level perceived usingmultimodal data. The pictures represent example of when this is not the case,for either Gesture Semantics Stress (high) or Gesture Modulation Stress.

The middle histogram in Figure 6 shows the distributionsfor medium stress. This time the proportion of segments whichdid not contain gestures decreased to 51%. Speech SemanticsStress is indicating no stress in 52% of the segments. Thisindicates that stress was perceived from other sources thanwords. Speech Modulation Stress is in this case the componentthat is most dominantly indicating label 2, together with thetwo gesture components. Nevertheless, it can be observedthat there are many combinations of the four componentspossible for which medium stress is assigned. This is a clearindication that automatically predicting the middle stress ischallenging, and that is bound to generate confusion and make

the recognition of the other stress classes difficult as well.The gesture pictures in the middle column of Figure 6 areexamples of gestures that appeared during medium stress, butwhich had label 1 or 3 for Gesture Modulation Stress andGesture Semantics Stress.

For the high stress level, depicted in the rightmost histogramof Figure 6, in 47% of the segments there were no gesture.What we conclude from this is a gradual increase in amount ofgesticulation when stress increases. Out of the segments whichdid contain gestures, 70% indicate a high level of stress viamodulation and 33% via their semantics. Speech ModulationStress is dominantly indicating high stress in a proportionof 95%, and Speech Semantics Stress for 51% of the cases.Examples of gestures which indicated lower stress by theirsemantics and modulation but appeared in segments labeledwith high stress are shown in the right column of Figure 6.

What we learn globally from Figure 6 is that the extremecases are quite well indicated by the four variables: when thereis no stress, or high stress, mostly the four communicationcomponents also indicate no and high stress, respectively. Themost complex is the case of medium stress, for which manycombinations are possible. The existence of all these possiblecombinations that lead to the same result, gives insight intothe difficulty of fusing them for automatic stress prediction.For research dealing with how to fuse inconsistent informationfrom audio and video in the context of multimodal aggressiondetection we refer to [31].

C. Predicting stress from the four communication components

As an experiment to see how well the labels of the fourcommunication components can be used for predicting stress,we considered them as features and applied a Bayesian net-work classifier. The ground truth for the classifier was thestress label. This results in a 71% weighted average accuracyand 73% unweighted average accuracy when we consider allthe data, and almost the same for when considering only thedata for which gestures were visible. The confusion matricesfor these two settings are shown in Table II. As expected,due to the high variability observed in expressing stress, wecan not achieve perfect accuracy even when we use the humanlabels for the four communication components. Note that eventhough the weighted average accuracies are equal for the twosettings, for the gesture data setup the recall of class 3 whichcorresponds to the most stressful situations is significantlyhigher. This finding highlights the importance of gestures andalso signifies the fact that performance in the all data setupsuffers from missing data. These recognition results can beseen as the upper bound performance of the fully automaticstress assessment task.

To summarize, we observe from Table I that Speech Modu-lation Stress is a very good indication for stress. The incidenceof gestures increases gradually with the increase in stress levelas seen in Figure 6, which means that even the frequencyof gestures can be an indication of stress. For no stress andhigh stress, the values of the four communication componentsconsistently indicate the same stress level most of the times.The medium stress level is characterized by a variety of

11

TABLE IICONFUSION MATRICES IN % FOR PREDICTING STRESS FROM THE HUMANANNOTATED COMMUNICATION COMPONENTS, FOR ALL DATA (LEFT) AND

ONLY GESTURES DATA (RIGHT).

Classified as1 2 3

1 76 24 02 13 74 133 4 32 63

UA = 71WA = 73

Classified as1 2 3

1 77 23 02 11 73 163 0 16 84

UA = 78WA = 76

combinations of all values of the four components, makingit more difficult to come to a conclusion. Finally, when usingthe labels of the four communication components to predictstress we achieve a UA of 71% for all data, and of 78% if weconsider only the segments that contain gestures.

VIII. AUTOMATIC STRESS ASSESSMENT - RESULTS ANDDISCUSSION

This section gives insight into the results for stress predic-tion using the model proposed in Figure 2, and compare it toa baseline of predicting stress directly from low level features.The section is organized in 3 parts: subsection VIII-A givesresults for the baseline method, subsection VIII-B focuses onautomatic prediction of the intermediate level variables fromour automatic model (recall Figure 2), and subsection VIII-Cpresents the final results for automatic stress assessment usingour model.

A. Baseline: predicting stress from low level features

The baseline results are for predicting stress directly fromlow level features. We provide results for using the acousticfeatures only, the video features only, as well as for featureand decision level fusion. Table III shows classification resultsfor the BN classifier, given LOSO cross-validation and thesegmentation described in section VI-C. It contains accuraciesper class as well as their weighted and unweighted average.

TABLE IIIBASELINE: PREDICTING STRESS FROM LOW LEVEL FEATURES (BAYESIAN

CLASSIFIER).

Features Fusion UA WA

acoustic - 64 62text - 47 51

STIP - 29 34

both FLF 61 62both DLF 62 61

From the results in Table III we notice that the audiofeatures were better predictors than the video ones, and thatboth feature and decision level improve over acoustic only.In Table IV we show the confusion matrix for decision levelfusion, which provides a good balance between UA and WA,as well as high recognition rates.

TABLE IVCONFUSION MATRICES IN % FOR AUTOMATICALLY PREDICTING STRESS

USING DECISION LEVEL FUSION FOR THE BASELINE APPROACH).

Classified as1 2 3

1 79 20 02 31 57 133 6 47 47

UA = 62WA = 61

We continue with the setup and results for predicting theintermediate level variables and the stress label using themodel we proposed.

B. Automatic prediction of the intermediate level variablesFor Speech Modulation Stress, the unit of analysis was the

utterance segmentation. For the four gesture related variables,the unit of analysis was the gesture segmentation. This meansthat while for Speech Modulation Stress we did analyze thewhole data, when learning the four gesture related variableswe used only the part of the data which contained gestures.

For each prediction task we tested the three classifiersmentioned above, and experimented with applying or notthe SMOTE technique to deal with data imbalance. Becausepredicting these variables is an intermediate step that affectsthe final stress prediction, we use the best performing approachin each case, as indicated in Table V.

TABLE VWEIGHTED (WA) AND UNWEIGHTED (UA) ACCURACIES FOR PREDICTING

THE INTERMEDIATE LEVEL VARIABLES FROM LOW LEVEL FEATURES.

Features Predicted variable SMOTE UA WA

acoustic Speech Modulation Stress 0 66 64

STIP

Gesture Valence 1 68 73Gesture Arousal 1 59 64Gesture Topics 0 57 61

Gesture Modulation Stress 0 68 65

Table V shows that Speech Modulation Stress is predictedwith the highest accuracy. Gesture Valance is also predictedwith high accuracy, followed by Gesture Modulation Stress.Given that the same generic low level features are used forall these task, we consider the results satisfying. Furthermore,the results might have been affected by the high degreeof approximation in the labels of Gesture Valence, GestureArousal and Gesture Topics, since they are generalized fromlabels of only one instance of each gesture class.

The Gesture Topics variable is the only one for whichinstead of a three class problem, there was a six class problem.Given that predicting these intermediate variables automati-cally from low level features were three or six class problems,three or six respectively soft decision values (posteriors) arepassed to the feature set for the final stress assessment.The intermediate level variables are predicted with differentaccuracies, which can also have an effect on how valuable thefeatures are for stress prediction.

12 JOURNAL OF LATEX CLASS FILES

C. Predicting stress using intermediate level variables

Table VI presents results for stress prediction using ourmodel with intermediate level variables. We study the perfor-mance of each intermediate variable independently. Startingwith Speech Modulation Stress, the best performing variable,we search for the best combination of two variables. Finally,we show the results obtained when all intermediate levelvariables are used.

TABLE VIPREDICTING STRESS BASED ON THE INTERMEDIATE LEVEL VARIABLES(BAYESIAN CLASSIFIER), NO SMOTE. SMS = SPEECH MODULATION

STRESS, SV = SPEECH VALENCE, SA = SPEECH AROUSAL, ST = SPEECHTOPICS, GMS = GESTURE MODULATION STRESS, GA = GESTUREAROUSAL, GV = GESTURE VALENCE AND GT = GESTURE TOPICS.

All dataFeatures Fusion UA WA

SMS - 66 62

SMS & text DLF 63 62SMS & GMS DLF 64 64SMS & GV DLF 67 64SMS & GA DLF 65 62SMS & GT DLF 67 64

all DLF 62 62all selected DLF 69 66

When studying the performance of using each single featuretype in turn, we observe that Speech Modulation Stress isperforming the best. A problem that appears with the otherwords-related or gesture-related features is their sparsity. Forthe words-related features, a problem is that the words fromthe ANEW list do not appear frequently in the spontaneousspeech of the actors. Also, the keywords corresponding to theSpeech Topics do not have a high frequency. Most informationconveyed linguistically regards information unrelated to theemotional state of the speaker, such as explaining directionsor a situation. The same problem occurs for gestures, sincethey are available for only part of the data, and treated asmissing data by the BN when not present.

The weighted and unweighted average accuracies of fusingSpeech Modulation Stress with each of the other intermediatelevel variables are shown in Table VI. It can be noticed thatadding almost any other variable does not cause significantchanges to the result. The best performance is achieved byfusing Speech Modulation Stress with the Gesture Topics andwith Gesture Valence. Another interesting phenomenon is thatadding more features does not always improve performance,and using all features is slightly worse than using only theSpeech Modulation posteriors. This is probably because theclassifier is fed less relevant features compared to the first3 very good features, and the performance drops. However,by running feature selection on this final feature vector, weobtain the best result. The selected feature set consists of thefollowing 10 features: the three posteriors of Speech Modula-tion, text arousal, the aggressive and sighs speech topics, theposteriors of gesture classes extrovert low and extrovert high.We observe a consistent increase in performance of on average4% for all stress levels using these features in addition to the

SMS posteriors.The confusion matrices for using only Speech Modulation

and the selected features are shown in Table VII, left and rightrespectively, while results obtained from an experiment onlyon the segments that contain gestures and feature selection ispresented in the confusion matrix in Table VIII.

TABLE VIICONFUSION MATRICES IN % FOR PREDICTING STRESS USING SPEECH

MODULATION STRESS FOR ALL DATA (LEFT) , AND OUR APPROACH WITHFEATURE SELECTION FOR ALL DATA (RIGHT).

Classified as1 2 3

1 77 21 22 33 49 173 5 25 70

UA = 66WA = 62

Classified as1 2 3

1 80 18 22 31 53 163 6 19 74

UA = 69WA = 66

TABLE VIIICONFUSION MATRIX IN % FOR PREDICTING STRESS GIVEN OUR

APPROACH, WITH FEATURE SELECTION ONLY ON GESTURE SEGMENTS.

Classified as1 2 3

1 74 24 22 25 60 153 4 23 73

UA = 69WA = 67

By inspecting which samples from the data were wellclassified by Speech Modulation Stress only and which onesbenefited by adding gestures, we noticed a number of interest-ing cases. For example, it can be the case that the employeeis speaking in a calm manner, but the visitor is visible andhis gestures are indicating stress. It can also be the case thatthere is no speech (this happens rarely in our data and onlyfor very short time intervals), and in that case gestures arethe only indication we get. This situation also appears whenthere is a case of physical aggression, e.g. throwing an objector pushing, which are sudden movements and sometimes notaccompanied by any sounds. All these cases would have beenmissed without using gestures.

It is interesting that Gesture Topics and Gesture Valenceare performing best in combination with Speech ModulationStress. Gesture Topics has the property of indicating a degreeof stress (the topics can be ordered with respect to stress),which can be seen as a quantitative function. Besides, it canalso be seen as a qualitative indicator of stress, since thetopics have this property and especially discriminate betweenstress types. Gesture Valence is an indicator of how positive ornegative a gesture is, and therefore has a direct relation withthe degree of stress.

All in all, the performance achieved using our approachsignificantly improves over the baseline. When comparing theper class accuracies of our approach with feature selection,(presented in Table VII-right), compared to standard decision

13

level fusion (Table IV), we observe a dramatic increase of27% in the recall of class 3, high stress, in the detrimentof 3 percentages in class 2 (medium stress). Given our envi-sioned application in the surveillance domain, this significantimprovement for recognizing high stress is very beneficialsince we are particularly interested in not missing samplesof medium and high stress.

When comparing the results achieved by using the humanlabels of the four communication components from Table II(left), to the results achieved by automatic prediction usingour final approach (confusion matrix in Table VII right), wenotice that the automatic prediction yields better performancefor high stress. Furthermore, the unweighted average accuracyfor automatic stress prediction is only 2% absolute lower thanthe prediction based on the human labels, and the unweightedaverage is 7% lower.

IX. SUMMARY AND CONCLUSION

To summarize, in the framework of automatic surveillance,we investigated modalities of how speech and gestures com-municate stress and how they can be used for automaticassessment of stress. For this purpose we proposed a humanmodel of stress communication, which distinguishes betweenthe semantics expressed by speech and gestures, and the wayin which the messages are delivered (modulation). We assessedhow these components convey stress based on human anno-tated labels. As a next step, we proposed a new method forautomatic stress prediction based on a decomposition of stressinto a set of intermediate level variables. The intermediatelevel variables were obtained by operationalizing the commu-nication components of the human model. We validated ourmodel for automatic stress prediction and obtained significantimprovements over a baseline predictor based on decision levelfusion on the audio, text and video features.

To conclude, we answer the four research questions statedin the introduction. The first research question concerns thecontribution of verbal and nonverbal communicative acts inconveying stress. Based on the analysis performed on ourhuman model, we suggest that Speech Semantics Stress andGesture Semantics Stress are considered verbal communica-tion, while Speech Modulation Stress and Gesture Modula-tion Stress as nonverbal communication. Our findings pointout that nonverbal communication, and in particular SpeechModulation Stress is the most dominant in communicatingstress. However, we learned that stress is conveyed by a largevariety of combination of these communicative acts, and ifnot considering them all we might miss the correct sceneinterpretation.

The second question refers to the contribution of gestures instress communication and stress assessment. From the humanmodel study we found that especially Gesture ModulationStress is highly correlated with stress level. Furthermore, weobserved an increase in the gesture frequency as the stressbecomes higher. When automatically assessing stress based ona single intermediate feature type, Speech Modulation Stresshad the best performance. When evaluating combinations oftwo intermediate level features, it combined best with Gesture

Topics. In general adding gesture information did not leadto high improvements. However, by examining the samplesfor which they had a positive impact we found that thesewere difficult cases, mostly from the middle and high stresscategories. Examples are: stressful gestures of the visitoraccompanied by calm speech of the employee, or stressfulgestures without any speech.

The third question relates to the choice and performance ofthe intermediate level features. The best performing featurewas Speech Modulation Stress, and it combined best withthe Gesture Topics. However, it must be noted that theperformance of a number of other variables was negativelyinfluenced by their sparsity.

Finally, to answer the fourth question, we state that ourmethod for stress prediction based on intermediate level vari-ables significantly improves over the baseline of predictingstress from low level audio-visual features. Furthermore, theincrease in performance is dramatic in the high stress class,which is highly beneficial for the envisioned application.

REFERENCES

[1] M. Andersson, S. Ntalampiras, T. Ganchev, J. Rydell, J. Ahlberg, andN. Fakotakis. Fusion of acoustic and optical sensor data for automaticfight detection in urban environments. In Information Fusion (FUSION),13th Conference on, pages 1–8, 2010.

[2] A. Batliner, S. Steidl, B. Schuller, D. Seppi, T. Vogt, J. Wagner, L. Dev-illers, L. Vidrascu, V. Aharonson, L. Kessous, and N. Amir. Whodunnit- Searching for the Most Important Feature Types Signalling Emotion-Related User States in Speech. Computer Speech and Language,25(1):4–28, 2011.

[3] L. Berkowitz and R. G. Geen. Affective aggression: The role of stress,pain, and negative affect. In: Donnerstein, Edward (Ed), (1998). Humanaggression: Theories, research, and implications for social policy, pages49–72, 1998.

[4] P. Boersma. Praat, a system for doing phonetics by computer. GlotInternational, 5(9/10), 2001.

[5] K. W. Bowyer, N. V. Chawla, L. O. Hall, and W. P. Kegelmeyer.SMOTE: Synthetic Minority Over-sampling Technique. CoRR,abs/1106.1813, 2011.

[6] M. M. Bradley and P. J. Lang. Affective norms for English words(ANEW) The NIMH Center for the Study of Emotion and Attention.University of Florida, 1999.

[7] G. Burghouts and K. Schutte. Correlations between 48 human actionsimprove their detection. In International Conference on Pattern Recog-nition, 2012.

[8] G. Caridakis, G. Castellano, L. Kessous, A. Raouzaiou, L. Malatesta,S. Asteriadis, and K. Karpouzis. Multimodal emotion recognitionfrom expressive faces, body gestures and speech. In C. Boukis,A. Pnevmatikakis, and L. Polymenakos, editors, Artificial Intelligenceand Innovations 2007: from Theory to Applications, volume 247 of IFIPThe International Federation for Information Processing, pages 375–388. Springer US, 2007.

[9] G. Castellano, S. D. Villalba, and A. Camurri. Recognising humanemotions from body movement and gesture dynamics. In Proceedings ofthe 2nd international conference on Affective Computing and IntelligentInteraction, ACII ’07, pages 71–82. Springer-Verlag, 2007.

[10] P. Ekman. Emotional and conversational nonverbal signals. In M. Lar-razabal, L. Miranda (Eds.), Language, knowledge, and representation,2004.

[11] P. Ekman and W. V. Friesen. The repertoire of nonverbal behavior:Categories, origins, usage, and coding. Semiotica, 1:49–98, 1969.

[12] T. Giannakopoulos, A. Makris, D. Kosmopoulos, S. Perantonis, andS. Theodoridis. Audio-visual fusion for detecting violent scenes invideos. In Artificial Intelligence: Theories, Models and Applications,volume 6040, pages 91–100, 2010.

[13] M. Grimm, K. Kroschel, E. Mower, and S. Narayanan. Primitives-basedevaluation and estimation of emotions in speech. Speech Communica-tion, 49(1011):787 – 800, 2007. Intrinsic Speech Variations.

14 JOURNAL OF LATEX CLASS FILES

[14] H. Gunes and M. Piccardi. Bi-modal emotion recognition from ex-pressive face and body gestures. Journal of Network and ComputerApplications, 30(4):1334 – 1345, 2007.

[15] H. Gunes, B. Schuller, M. Pantic, and R. Cowie. Emotion representation,analysis and synthesis in continuous space: A survey. In AutomaticFace Gesture Recognition and Workshops (FG 2011), 2011 IEEEInternational Conference on, pages 827–834, 2011.

[16] J. Hansen, S. Bou-Ghazale, R. Sarikaya, and B. Pellom. Getting startedwith SUSAS: a speech under simulated and actual stress database. InEUROSPEECH, volume 97, pages 1743–46, 1997.

[17] J. Hansen and S. Patil. Speech under stress: Analysis, modeling andrecognition. In C. Muller, editor, Speaker Classification I, volume 4343of Lecture Notes in Computer Science, pages 108–137. 2007.

[18] P. Juslin and K. Scherer. In J. Harrigan, R. Rosenthal, and K.Scherer, (Eds.) - The New Handbook of Methods in Nonverbal BehaviorResearch, chapter Vocal Expression of Affect, pages 65–135. OxfordUniversity Press, 2005.

[19] I. Kanluan, M. Grimm, and K. Kroschel. Audio-visual emotionrecognition using an emotion space concept. In 16th European SignalProcessing Conference, Lausanne, Switzerland, 2008.

[20] M. Karg, A.-A. Samadani, R. Gorbet, K. Kuhnlenz, J. Hoey, andD. Kulic. Body movements for affective expression: a survey ofautomatic recognition and generation. Affective Computing, IEEETransactions on, 4(4):341–359, 2013.

[21] A. Kendon. Gesture: Visible Action as Utterance. Cambridge: Cam-bridge University Press, 2004.

[22] M. Kipp. Anvil - a generic annotation tool for multimodal dialogue. InProceedings of the 7th European Conference on Speech Communicationand Technology (Eurospeech), 2001.

[23] M. Kipp and J.-C. Martin. Gesture and emotion: Can basic gesturalform features discriminate emotions? In Affective Computing and Intel-ligent Interaction and Workshops, 2009. ACII 2009. 3rd InternationalConference on, pages 1 –8, 2009.

[24] R. M. Krauss, Y. Chen, and P. Chawla. Nonverbal behavior andnonverbal communication: What do conversational hand gestures tellus? Advances in experimental social psychology, 28:389–450, 1996.

[25] R. M. Krauss, R. Dushay, Y. Chen, and F. Rauscher. The communicativevalue of conversational hand gestures. Journal of Experimental SocialPsychology, 31:533–552, 1995.

[26] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learningrealistic human actions from movies. In Computer Vision and PatternRecognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8, 2008.

[27] O. Lartillot and P. Toiviainen. A Matlab toolbox for musical featureextraction from audio. In International Conference on Digital AudioEffects, pages 237–244, 2007.

[28] R. S. Lazarus and S. Folkman. Stress, appraisal, and coping. SpringerPublishing Company, 1984.

[29] I. Lefter, G. Burghouts, and L. Rothkrantz. An audio-visual dataset ofhuman-human interactions in stressful situations. Journal on MultimodalUser Interfaces, 8(1):29–41, 2014.

[30] I. Lefter, G. Burghouts, and L. J. M. Rothkrantz. Automatic audio-visual fusion for aggression detection using meta-information. InAdvanced Video and Signal-Based Surveillance (AVSS), 2012 IEEENinth International Conference on, pages 19–24, Sept 2012.

[31] I. Lefter, L. Rothkrantz, and G. Burghouts. A comparative studyon automatic audiovisual fusion for aggression detection using meta-information. Pattern Recognition Letters, 34(15):1953 – 1963, 2013.

[32] I. Lefter, L. J. Rothkrantz, and G. J. Burghouts. Aggression detectionin speech using sensor and semantic information. In Text, Speech andDialogue, pages 665–672. Springer, 2012.

[33] I. Lefter, L. J. Rothkrantz, D. A. Van Leeuwen, and P. Wiggers.Automatic stress detection in emergency (telephone) calls. InternationalJournal of Intelligent Defence Support Systems, 4(2):148–168, 2011.

[34] D. McNeill. So you think gestures are nonverbal? Psychological review,92(3):350–371, 1985.

[35] M. Neff, Y. Wang, R. Abbott, and M. Walker. Evaluating the effect ofgesture and language on personality perception in conversational agents.In J. Allbeck, N. Badler, T. Bickmore, C. Pelachaud, and A. Safonova,editors, Intelligent Virtual Agents, volume 6356 of Lecture Notes inComputer Science, pages 222–235. Springer Berlin Heidelberg, 2010.

[36] K. Scherer. In Appley, M.H., and Trumbull, R. (Eds.), Dynamics ofstress, chapter Voice, stress, and emotion, pages 159–181. New York:Plenum., 1986.

[37] K. R. Scherer and T. Banziger. On the use of actor portrayals in researchon emotional expression. In K. R. Scherer, T. Banziger, and E. B.Roesch, editors, Blueprint for affective computing: A sourcebook, pages166–176. Oxford, England: Oxford university Press, 2010.

[38] K. R. Scherer, T. Johnstone, and G. Klasmeyer. Vocal Expression ofEmotion. R. J. Davidson, H. Goldsmith, K. R. Scherer (Eds.) - Handbookof the Affective Sciences. Oxford University Press, 2003.

[39] B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Scherer, F. Ringeval,M. Chetouani, F. Weninger, F. Eyben, E. Marchi, et al. The INTER-SPEECH 2013 computational paralinguistics challenge: social signals,conflict, emotion, autism. 2013.

[40] J. R. Searle. Speech acts: An essay in the philosophy of language.Cambridge university press, 1969.

[41] J. Sprague, E. Verona, W. Kalkhoff, and A. Kilmer. Moderators andmediators of the stress-aggression relationship: Executive function andstate anger. Emotion, 11(1):61–73, 2011.

[42] C. M. Whissell. The Dictionary of Affect in Language, volume 4, pages113–131. Academic Press, 1989.

[43] Z. Yang. Multi-Modal Aggression Detection in Trains. PhD thesis, DelftUniversity of Technology, 2009.

[44] Z. Zeng, M. Pantic, G. Roisman, and T. Huang. A survey of affectrecognition methods: Audio, visual, and spontaneous expressions. Pat-tern Analysis and Machine Intelligence, IEEE Trans. on, 31(1):39–58,2009.

Iulia Lefter is a postdoctoral researcher at DelftUniversity of Technology (TUDelft). She receivedher BSc (Computer Science) from TransilvaniaUniversity of Brasov, and her MSc (Media andKnowledge Engineering) from TUDelft. In 2014 sheobtained her PhD, working on a project involv-ing TUDelft, TNO, and The Netherlands DefenceAcademy. Her PhD work focuses on behavior in-terpretation for automatic surveillance using multi-modal data. Her interests include multimodal com-munication, affective computing, behavior recogni-

tion and multimodal fusion.

Gertjan J. Burghouts is a lead scientist in vi-sual pattern recognition at TNO (Intelligent Imag-ing group) , the Netherlands. He studied artificialintelligence at the University of Twente (MSc degree2002) with a specialization in pattern analysis andhuman-machine interaction. In 2007 he received hisPhD from the University of Amsterdam on the topicof visual recognition of objects and their motion, inrealistic scenes with varying conditions. His researchinterests cover recognition of events and behavioursin multimedia data. He was principal investigator of

the Cortex project within the DARPA Mind’s Eye program.

Leon Rothkrantz studied Mathematics at the Uni-versity of Utrecht and Psychology at the Universityof Leiden. He completed his PhD-study Mathematicsat the University of Amsterdam. Since 1980 heis appointed as (Associate) Professor MultimodalCommunication at Delft University of Technologyand since 2008 as Professor Sensor Technology atThe Netherlands Defence Academy. He was visitinglecturer at the University of Prague. He receivedmedals of honours from the Technical Universityof Prague and the Military Academy at Brno. Prof.

Rothkrantz is (co-)author of more that 200 scientific papers on ArtificialIntelligence, Speech Recognition, Multimodal Communication and Education.