a multimodal corpus for mutual gaze and joint …of-the-art in multimodal multiparty corpora and...

9
A Multimodal Corpus for Mutual Gaze and Joint Attention in Multiparty Situated Interaction Dimosthenis Kontogiorgos 1 , Vanya Avramova 1 , Simon Alexanderson 1 , Patrik Jonell 1 , Catharine Oertel 12 , Jonas Beskow 1 , Gabriel Skantze 1 , Joakim Gustafson 1 1 KTH Royal Institute of Technology, Sweden 2 ´ Ecole Polytechnique F´ ed´ erale de Lausanne (EPFL), Switzerland Abstract In this paper we present a corpus of multiparty situated interaction where participants collaborated on moving virtual objects on a large touch screen. A moderator facilitated the discussion and directed the interaction. The corpus contains recordings of a variety of multimodal data, in that we captured speech, eye gaze and gesture data using a multisensory setup (wearable eye trackers, motion capture and audio/video). Furthermore, in the description of the multimodal corpus, we investigate four different types of social gaze: referential gaze, joint attention, mutual gaze and gaze aversion by both perspectives of a speaker and a listener. We annotated the groups’ object references during object manipulation tasks and analysed the group’s proportional referential eye-gaze with regards to the referent object. When investigating the distributions of gaze during and before referring expressions we could corroborate the differences in time between speakers’ and listeners’ eye gaze found in earlier studies. This corpus is of particular in- terest to researchers who are interested in social eye-gaze patterns in turn-taking and referring language in situated multi-party interaction. Keywords: multimodal situated interaction, social eye-gaze, referential gaze, joint attention, mutual gaze, reference resolution 1. Introduction In this corpus we combine verbal and non-verbal communi- cation cues to model shared attention in situated interaction. We capture multimodal cues using a multisensory setup in order to extract information on visual attention during de- ictic references in collaborative dialogue. We made use of a moderator who had the task of leading the interaction and making sure that both participants were involved in the task and in the conversation. During the interaction the modera- tor used referring expressions and gaze to direct the partici- pants’ attention to objects of interest on a large touch screen (figure 1). Figure 1: Participants used referring expressions and gaze to direct each other’s attention while moving objects on a large display. Our purpose of collecting the corpus presented in this pa- per is to study social eye gaze in multiparty interaction. The goal is to create visual attention models that make it possible for our robots to direct humans’ attention to cer- tain objects. There are several novelties with our corpus: Firstly, it is a three-party interaction with and without an interactive touch screen where we have synchronised data streams of gaze targets, head direction, hand movement and speech (run through ASR with word timings). Second, one of the three participants is a mediator that has the role of en- couraging the participants to reconsider their decisions and to foster collaboration. In all recordings we use the same person as the mediator, which makes it possible for us to build coherent verbal and non-verbal behavioural models. We will use these to develop a robot that can be used as a moderator in similar collaborative tasks. To our knowl- edge there is no other publicly available corpus with these features. During their interaction, the participants collaborated to furnish a virtual apartment using a collection of available furniture objects. It was their task to discuss and decide on which objects they would use, given that they had a lim- ited budget. The moderator had the role of leading the dis- cussion. However, this does not change the fact that this is an example of dynamic multiparty situated interactions that (Bohus and Horvitz, 2009) define as an open-world di- alogue. When discussing the furniture objects, the partici- pants naturally made use of a combination of verbal refer- ring expressions and non-verbal cues such as deictic ges- tures and referential gaze. We are particularly interested in the participants’ gaze behaviour just before and during verbal referring expressions. When analysing our corpus we find that listeners and speakers display different visual behaviour in some cases. In the next sessions, we provide an overview of the state- of-the-art in multimodal multiparty corpora and relevant works in various types social eye gaze. We further describe our process in data collection and experiment design, and our methods for automatic eye gaze extraction. Finally, we go through the collected data and give an overview of the participants’ gaze behaviour during the task-oriented dia- logues and our findings supported by the relevant literature. 119

Upload: others

Post on 10-Aug-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Multimodal Corpus for Mutual Gaze and Joint …of-the-art in multimodal multiparty corpora and relevant works in various types social eye gaze. We further describe our process in

A Multimodal Corpus for Mutual Gaze and Joint Attention in MultipartySituated Interaction

Dimosthenis Kontogiorgos1 Vanya Avramova1 Simon Alexanderson1 Patrik Jonell1Catharine Oertel1 2 Jonas Beskow1 Gabriel Skantze1 Joakim Gustafson1

1 KTH Royal Institute of Technology Sweden2 Ecole Polytechnique Federale de Lausanne (EPFL) Switzerland

AbstractIn this paper we present a corpus of multiparty situated interaction where participants collaborated on moving virtual objects ona large touch screen A moderator facilitated the discussion and directed the interaction The corpus contains recordings of avariety of multimodal data in that we captured speech eye gaze and gesture data using a multisensory setup (wearable eye trackersmotion capture and audiovideo) Furthermore in the description of the multimodal corpus we investigate four different types ofsocial gaze referential gaze joint attention mutual gaze and gaze aversion by both perspectives of a speaker and a listener Weannotated the groupsrsquo object references during object manipulation tasks and analysed the grouprsquos proportional referential eye-gazewith regards to the referent object When investigating the distributions of gaze during and before referring expressions we couldcorroborate the differences in time between speakersrsquo and listenersrsquo eye gaze found in earlier studies This corpus is of particular in-terest to researchers who are interested in social eye-gaze patterns in turn-taking and referring language in situated multi-party interaction

Keywords multimodal situated interaction social eye-gaze referential gaze joint attention mutual gaze reference resolution

1 IntroductionIn this corpus we combine verbal and non-verbal communi-cation cues to model shared attention in situated interactionWe capture multimodal cues using a multisensory setup inorder to extract information on visual attention during de-ictic references in collaborative dialogue We made use ofa moderator who had the task of leading the interaction andmaking sure that both participants were involved in the taskand in the conversation During the interaction the modera-tor used referring expressions and gaze to direct the partici-pantsrsquo attention to objects of interest on a large touch screen(figure 1)

Figure 1 Participants used referring expressions and gazeto direct each otherrsquos attention while moving objects on alarge display

Our purpose of collecting the corpus presented in this pa-per is to study social eye gaze in multiparty interactionThe goal is to create visual attention models that make itpossible for our robots to direct humansrsquo attention to cer-tain objects There are several novelties with our corpusFirstly it is a three-party interaction with and without aninteractive touch screen where we have synchronised data

streams of gaze targets head direction hand movement andspeech (run through ASR with word timings) Second oneof the three participants is a mediator that has the role of en-couraging the participants to reconsider their decisions andto foster collaboration In all recordings we use the sameperson as the mediator which makes it possible for us tobuild coherent verbal and non-verbal behavioural modelsWe will use these to develop a robot that can be used asa moderator in similar collaborative tasks To our knowl-edge there is no other publicly available corpus with thesefeatures

During their interaction the participants collaborated tofurnish a virtual apartment using a collection of availablefurniture objects It was their task to discuss and decide onwhich objects they would use given that they had a lim-ited budget The moderator had the role of leading the dis-cussion However this does not change the fact that thisis an example of dynamic multiparty situated interactionsthat (Bohus and Horvitz 2009) define as an open-world di-alogue When discussing the furniture objects the partici-pants naturally made use of a combination of verbal refer-ring expressions and non-verbal cues such as deictic ges-tures and referential gaze We are particularly interestedin the participantsrsquo gaze behaviour just before and duringverbal referring expressions When analysing our corpuswe find that listeners and speakers display different visualbehaviour in some cases

In the next sessions we provide an overview of the state-of-the-art in multimodal multiparty corpora and relevantworks in various types social eye gaze We further describeour process in data collection and experiment design andour methods for automatic eye gaze extraction Finally wego through the collected data and give an overview of theparticipantsrsquo gaze behaviour during the task-oriented dia-logues and our findings supported by the relevant literature

119

2 Background21 Multiparty multimodal corporaOver the last decade more and more multimodal multi-party corpora have been created such as the ones describedin (Carletta 2007 Mostefa et al 2007 Oertel et al 2014Hung and Chittaranjan 2010 Oertel et al 2013 Stefanovand Beskow 2016) (Carletta 2007) and (Mostefa et al2007) fall into the category of meeting corpora (Hung andChittaranjan 2010) and (Stefanov and Beskow 2016) areexamples of games corpora and (Oertel et al 2014) a jobinterviewing corpus The corpus from (Oertel et al 2013)in contrast to the corpora listed above tries to escape the labenvironment and gathers data of multi-party interactions rdquointhe wildrdquoIn (Carletta 2007) (Mostefa et al 2007) and (Oertel et al2013) the visual focus of interlocutors is divided ie thereare stretches in the corpus in which interlocutorsrsquo main fo-cus of attention is on each other and there are stretchesin the corpus where they mainly focus on eg the whiteboard or sheets of papers which are laying in front of themHowever with the technical set-up used at the recordingsat the time it was hard to infer the exact points the partic-ipant were looking In (Oertel et al 2014) and (Hung andChittaranjan 2010) participantsrsquo visual focus of attentionis solely focused on the other participants (no other objectsare present during the recordings)Another example is (Stefanov and Beskow 2016) who intheir corpus study the visual focus of attention of groupsof participants For this they recorded groups of three par-ticipants while they were sorting cards They also recordedgroups of three participants discussing their travel expe-riences without any objects present that might distract thevisual focus of attentionFinally while (Lucking et al 2010) is not a multiparty cor-pus it should be mentioned here as it is similar to the cor-pus described in this paper particularly well suited for thestudy of referring expressions In terms of experimentalsetup the corpus recording described in this paper is mostsimilar to (Stefanov and Beskow 2016)

22 Social eye-gazeSocial eye gaze refers to the communicative cues of eyecontact between humans and is usually referred to by 4main types (Admoni and Scassellati 2017) 1 Mutual gazewhere both interlocutorsrsquo attention is directed at each other2) Joint attention where both interlocutors focus their at-tention on the same object or location 3) Referential gazewhich is directed to an object or location and often comestogether with referring language and 4) Gaze aversions thattypically avert from the main direction of gaze - ie theinterlocutorrsquos faceJoint attention is of particular importance for communica-tion It provides participants with the possibility to interpretand predict each otherrsquos actions and react accordingly Inthe current corpus for example the modeling of joint atten-tion is of particular importance as it provides participantswith the possibility to track the interlocutorsrsquo current focusof attention in the discourse (Grosz and Sidner 1986) Acommon quality of joint attention is that it may start with

mutual gaze to establish where is the attention of the inter-locutor and end towards the referential gaze direction to themost salient object (Admoni and Scassellati 2017)Also as participants are very likely to be focused more onthe display than each other joint attention will be crucialto discern whether they are paying attention to each otherModelling of joint attention is also crucial when developingfine-grained models of dialogue processing (Schlangen andSkantze 2009) which for example makes it possible fora dialogue system to give more timely feedback (Meenaet al 2014) With regards to multi-party interaction thereare also recent studies which model the situation in whichthe interaction takes place in order to manage several userstalking to the system at the same time (Bohus and Horvitz2010) and references to objects in the shared visual scene(Kennington et al 2013)

23 Multimodal reference resolutionA reference is typically a symbolic representation of a lin-guistic expression to a specific object or abstraction Durin-ing early attempts in verbal communications between hu-mans and machines researchers used rule-based systems todisambiguate words and map them to referent objects in vir-tual worlds (Winograd 1972) Research has also focused indisambiguating language using multimodal cues startingin the late 70s with Richard Boltrsquos rdquoPut-That-Thererdquo (Bolt1980) to recent approaches using eye-gaze (Mehlmann etal 2014 Prasov and Chai 2008 Prasov and Chai 2010)head pose (Skantze et al 2015) and pointing gestures(Lucking et al 2015) Gross et al recently explored thevariability and interplay of different multimodal cues in ref-erence resolution (Gross et al 2017)Eye gaze and head direction have been shown to be goodindicators of object saliency in human interactions re-searchers have developed computational methods to con-struct saliency maps and identify humansrsquo visual attention(Bruce and Tsotsos 2009 Sziklai 1956 Borji and Itti2013 Sheikhi and Odobez 2012) Typically speakers di-rect the listenersrsquo attention to objects using verbal and nonverbal cues and listeners often read the speakerrsquos visual at-tention during referring expressions to get indications onthe referent objects

3 Data collection31 ScenarioWe optimised for variation in conversational dynamics bydividing the current corpus recordings into two conditionsIn the first condition the moderator facilitated a discussionabout participantsrsquo experience on the topic of sharing anapartment In this condition no distracting objects were ex-istent and the screen on the large display was off In the sec-ond condition the participants were asked to collaborate ondecorating an apartment (Figure 2) that they should imag-ine they would be moving in together They were givenan empty flat in which they had to decide where to placefurniture which they could buy from the store The storeswere provided as extra screens on the application that theycould go through to get new pieces of furniture They weregiven a limited budget which fostered their decisions anddiscussions on what objects to choose Given the variety

120

of available objects they would need to compromise theirchoices and collectively decide where to place objects

Figure 2 The application running on the large display Theobjects were rdquoboughtrdquo from the furniture shop that had avariety of available furniture

32 ParticipantsWe collected data from 30 participants with a total of 15interactions Our moderator (native US-English speaker)was present on all 15 sessions facilitating the structure ofthe interactions and instructing participants on their role forcompleting the task The moderator was always at the samepart of the table and the two participants were sitting acrossthe moderator (Figure 1) The mean age of our participantswas 257 11 were female and 19 were male and the ma-jority of them were students or researchers at KTH RoyalInstitute of Technology in Stockholm Sweden

33 CorpusOur corpus consists of a total of 15 hours of recordings oftriadic interactions All recordings (roughly one hour each)contain data from various sensors capturing motion eyegaze audio and video streams Each session follows thesame structure The moderator welcomes the participantswith a brief discussion on the topic of moving in a flat withsomeone and thereafter introducing the setup and scenarioof the two participants planning their moving in the sameapartment using the large display During the interactionswe collected a set of multimodal data a variety of inputmodalities that we combined to get information on partici-pantsrsquo decision making and intentionsOut of the 15 sessions 2 sessions had no successful eyegaze calibration and were discarded One session had syn-chronisation issues on the screen application which wasalso discarded Last one session had large gaze data gapsand was discarded as well The rest 11 sessions of theaggregated and processed data from the corpus are fur-ther described on the rest of the paper and are availableat httpswwwkthseprofiledikopagematerial

34 Experimental setupParticipants were situated around a table which had a largetouch display On the display there was an application wedeveloped to facilitate their planning of a moving in to-gether in a flat scenario We gave participants a pair of

gloves with reflective markers and eye tracking glasses (To-bii Glasses 21) which also had reflective markers on to tracktheir position in space The room was surrounded with 17motion capture cameras positioned in such a way that bothgloves and glasses are always on the camerasrsquo sightThe moderator was also wearing eye tracking glassesSince our aim is to develop models for a robot withouthands the moderator was not wearing gloves with mark-ers and was instructed to avoid using hand gestures Therewere two cameras placed on the table capturing facial ex-pressions and a regular video camera at a distance recordingthe full interaction for annotation purposes On the glasseswe also placed lavalier microphones (one per participant)in such a way that we capture the subjectrsquos voice separatedfrom the rest of the subjectsrsquo speech and with a volumeconsistencyThe participants collaborated in the given scenario for 1hour where they discussed and negotiated to form a com-mon solution in apartment planning At the end of therecording we asked participants to fill a questionnaire ontheir perception of how the discussion went their negoti-ations with the other participant and finally some demo-graphic information All participants were reimbursed witha cinema ticket

35 Motion captureWe used an OptiTrack motion capture system2 to collectmotion data from the subjects The 17 motion capturecameras collected motion from reflective markers on 120frames per second To identify rigid objects in the 3d spacewe placed 4 markers per object of interest (glasses glovesdisplay) and captured position (x y z) and rotation (x y zw) for each rigid object While 3 markers are sufficient forcapturing the position of a single rigid body we placed a4th marker on each object for robustness That way if oneof the markers was not captured we would still identify therigid object in space

36 Eye gazeWe were interested in collecting eye gaze data for each par-ticipant in order to model referential gaze joint attentionmutual gaze and gaze aversion in multiparty situated dia-logue (figure 3) In order to capture eye gaze in 3D spacewe used eye tracking glasses with motion capture mark-ers This made it possible to accurately identify when aparticipantrsquos gaze trajectory intersected objects or the otherinterlocutors It also made it possilbe to capture gaze aver-sion ie when participants gazed away when speaking orlistening to one of the participantsGaze samples were on 50Hz and the data was captured bytracking the subjectsrsquo pupil movements and pupil size anda video from their point of reference We placed reflectivemarkers on each pair of glasses and were therefore able toidentify their gaze trajectory in x and y on their point ofreference and then resolve it to world coordinates using theglassesrsquo relevant position in 3d spaceThe glasses using triangulation from both eyesrsquo positionsalso provided a z value which would refer to the point

1httpwwwtobiiprocom2httpoptitrackcom

121

where the trajectories from the two eyes meet That pointin space (x y z) we used to resolve the eye gaze trajectoryin 3d space

Figure 3 By combining synchronised data from motioncapture and eye tracking glasses we captured the partici-pantsrsquo eye gaze in 3d space but also head pose and gestures

37 Speech and languageWe recorded audio from each participant using channel sep-arated lavalier microphones attached to the eye trackingglasses Each microphone captured speech that we laterused to automatically transcribe using speech recognitionand resolve the spoken utterances to text We used a voiceactivity detection filter in all channels to separate capturedspeech from other speakers We further used Tensorflowrsquos3

neural network parser to form syntactic representations ofthe automatically transcribed natural language text

38 Facial expressionsIn order to extract facial expressions we placed GoPro cam-eras in front of the participants and moderator on the tableWe also recorded each session from a camera placed on atripod further in the room to capture the interaction as awhole and for later usage in annotating the corpus

39 ApplicationWe built an application to enable the interaction with sev-eral virtual objects (figure 2) The application consisted oftwo main views a floor plan and a furniture shop The floorplan displayed on the large multi-touch display was used bythe participants during their interactions as the main area tomanipulate the objects while the shops were used to collectnew objectsThe shop screens were divided by room categories such askitchen or living room They naturally induced deictic ex-pressions as the participants referred to the objects duringtheir negotiations using both language and referential gaze(ie rdquoitrdquo the deskrdquo rdquothe bedrdquo) The floor plan on the con-trary was the main display were they could manipulate ob-jects by placing them in rooms and used referring languageto these virtual locations (ie rdquohererdquo rdquothererdquo rdquomy roomrdquo)

3httpswwwtensoroworgversionsr012tutorialssyntaxnet

The floor plan view also displayed the apartment scale andthe budget the participants could spend for objects Thebudget was limited and after initial pilots it was decidedto have a value that would be enough to satisfy both partici-pantsrsquo furniture selections but also limited enough to fosternegotiations on what objects they would select They werealso allowed to only choose one of each item in order tofoster negotiations furtherThe application maintained event-based logging at 5 fps tokeep track of the interaction flow object movement and al-locations Each event carried time stamps and relevant ap-plication state as well as positions and rotation data for allobjects By placing markers in the corners of the screenit could be placed into the motion capture coordinate sys-tem This allowed us to get the virtual object events in 3Dspace and capture the participantsrsquo visual attention to theseobjects

310 Subjective measuresAt the end of each session we gave participants a question-naire to measure their impression on how the discussionwent and how well they thought they had collaborated whendecorating the appartment We used these measures to mea-sure dominance as well as collaborative behaviour and per-sonality All participants were asked to fill personality testsbefore coming to the lab the tests included introversion andextroversion measures (Goldberg 1992)

311 CalibrationThe sensors we used required calibration in order to suc-cessfully capture motion and eye movements We cali-brated all 17 cameras positioning at the beginning of allrecordings while the eye tracking glasses required calibra-tion on each subject separately That is due to the fact thateach participantrsquos eye positions vary but also how their eyesmove during saccades

4 AnnotationsWe annotated referring expressions to objects on the dis-play by looking at what object the speaker intended to referto Speakers typically drew their interlocutorsrsquo attention toobjects using deictic expressions referring language refer-ential gaze as well as mutual gaze For every dialogue therewas a set of references to objects not included in our simu-lated environment -rdquoI also have a fireplace in my flat but Ido not use it a lotrdquo However we only annotated referencesthat can be resolved to objects existing in our simulated en-vironment ie the application on the touch display Thereferences we can resolve in this environment are thereforeonly a subset of all possible references in the dialogueFor the annotations we used the videos of the interactionstogether with the ASR transcriptions and the current stateof the app (visible objects or current screen on the large dis-play) We defined each referring expression by looking atthe time of the speakerrsquos utterance The timing was definedas from the ASR transcriptions that were synchronised withthe gaze and gesture data Utterances were split into in-ter pausal units (IPUs) that were separated by silent pauseslonger than 250ms (Georgeton and Meunier 2015)

122

The ASR transcriptions were used as input to the Tensor-flow syntactic parser where we extracted the part-of-speech(POS) tags Similarly to (Gross et al 2017) as linguisticindicators we used full or elliptic noun-phrases such as rdquothesofardquo rdquosofardquo rdquobedroomrdquo pronouns such as rdquoitrdquo or rdquothisrdquoand rdquothatrdquo possessive pronouns such as rdquomyrdquo rdquominerdquo andspatial indexicals such as rdquohererdquo or rdquothererdquo The ID of thesalient object or location of reference was identified andsaved on the annotations In some cases there was onlyone object referred to by the speaker but there were alsocases where the speaker referred to more than one object inone utterance (ie rdquothe table and the chairsrdquo) The roles ofthe speaker and listeners varied in each expression In totalwe annotated 766 referring expressions out of 5 sessionsroughly 5 hours of recordings which was about onethird ofour corpus

5 Data processingThe variety of modalities was post-processed and analysedwhere in particular combined eye gaze and motion cap-ture provided an accurate estimation of gaze pointing in3D space We used a single-world coordinate system inmeters on both eye gaze and motion capture input streamsand we identified the gaze trajectories head pose and handgestures We started with a low-level multimodal fusionof the input sensor streams and further aggregated the datatogether with speech and language to high-level sequencesof each participantrsquos speech and visual attention Each datapoint had automatic speech transcriptions syntactic parsingof these transcriptions gesture data in the form of movingobjects and eye gaze in the form of target object or person

51 Data synchronisationWe used sound pulses from each device to sync all signalsin a session The motion capture system sent a pulse oneach frame captured while the eye tracking glasses sent 3pulses every 10 seconds The application also sent a pulseevery 10 seconds All sound signals were collected on thesame 12 channel sound card which would then mark thereference point in time for each sessionAudio signals were also captured on the soundcard there-fore we were able to identify a reference point that wouldset the start time for each recording That was the last of thesensors to start which was one of the eye tracking glassesApart from the sound pulses we used a clapperboard withreflective markers on its closed position which would markthe start and the end of each session We used that to syncthe video signals and as a safety point in case one of thesound pulses failed to start Since the application on thedisplay sent sync signals we used it to mark the separationin time of the two experimental conditions The first one(free discussion) ended when the app started which wouldlead to the second condition (task-oriented dialogue)

52 Visualisation and visual angle thresholdAs illustrated in figure 3 we calculated the eye gaze trajec-tories by combining motion capture data and eye trackingdata per frame We implemented a visualisation tool4 in

4Available at httpswwwkthseprofiledikopagematerial

WebGL to verify and evaluate the calculated eye gaze in 3dspace Using this tool we were able to qualitatively inves-tigate the sections of multimodal turn taking behaviour ofparticipants by visualising their position and gaze at objectsor persons along with their transcribed speech and audio inwav format We also used this tool to empirically definethe visual angle threshold for each eye tracker on the accu-racy of gaze targets During pilots we asked participants tolook at different objects ensuring there are angle variationsto identify the threshold of the gaze angle to the dislocationvector (between the glasses and the salient object)

53 Automatic annotation of gaze dataWe extracted eye-gaze data for all participants and all ses-sions and noticed that there were gaps between gaze datapoints quite often and in some sessions more than in othersWe applied smoothing to eliminate outliers and interpolatedgaps in a 12 frame step (100ms on our 120 fps data) Typ-ically an eye fixation is 250ms but no smaller than 100ms(Rayner 1995) which defined our smoothing strategy Thedata was then filtered in the same time window to only pro-vide data points with fixations and not include saccades oreye blinksThere were however gaps that were longer than 12 framescaused by lack of data from the eye trackers In suchcases the eye trackers had no information on the eyesrsquo po-sitions which means that there was no knowledge on if aspeakerlistener looked at the referent object In such casesof no gaze data we went through the manually annotatedreferring expressions and checked the relevant frames ofthe gaze data If at least one of the participants had no gazedata we would discard the relevant referring expressionfrom our analysis After cleaning those cases we had re-maining 582 expressions out of the 766 initially annotated5

The average error of gaze data loss we had was 408 Thesession with the max gaze error rate was 715 while themin was 269During an referring expression participantsrsquo visual atten-tion spanned through a) a variety of visible objects on thescreen b) their interlocutors or c) none of the above whichwe assumed on this corpus to be gaze aversion The promi-nent objects of visual attention were identified by calculat-ing each participantrsquos visual angle α between their gazevector g to the dislocation vector o for every visible objecton the screen for every frame

αij = arccos

(~gij middot ~oij| ~gij | | ~oij |

)(1)

Similarly we approximated each personrsquos head with asphere with radius of 02m and automatically annotated allframes where the gaze vector intersected the sphere as vi-sual attention towards that personFinally after filtering for eye fixations we calculated theproportional gaze likelihood per object

P (oi | ti) =c(oi ti)

c(ti)(2)

5The 5 sessions chosen for annotation were the ones with thesmallest percentage of gaze data loss

123

For each object oi we gathered the proportional gaze datapoints and counted the amount of time the object was gazedduring the time t of an utterance As a gaze target we de-fined the area around any object that is within the thresh-old of the visual angle defined above In many cases morethan one objects competed for saliency in the same gazetarget area We therefore defined as joint attention not thegroup of objects gazed by the interlocutors but the gaze atthe same area around the prominent object given the visualangle threshold

6 Results

In the current section we describe gaze patterns as ob-served between and across the two conditions and propor-tional gaze during and before referring expressions In fig-ure 4 we show the proportional eye-gaze of the moderatoras well as the participantsrsquo eye gaze on the display As canbe observed the participants spent more time gazing on thedisplay than at the moderator or at each other which wasexpected to observe during a task-oriented dialogue

Figure 4 Proportional amount of participant eye-gaze persession on the display during the second condition (task-oriented dialogue)

For each utterance a set of prominent objects was definedfrom the annotations Given the gaze targets per partici-pant we calculated the proportional gaze from the speakerand the listeners during the time of the utterance and ex-actly 1 second before the utterance Since all interactionswere triadic there were two listeners and one speaker at alltimes To compare across all utterances we looked at themean proportional gaze of the two listeners to the area ofthe prominent object to define them as the listener groupWe then compared the gaze of the listener group to thespeakers gaze close to the referent objects We also lookedat the proportional combined gaze to other objects duringthe utterance (all other objects that have been gazed at dur-ing the utterance) gaze at the speaker and averted gaze Infigure 5 the blue colour refers to the proximity area of thereferent object from the speakerrsquos utterance while orangerefers to the gaze at all other known objects combined onthe virtual environment Grey is for the gaze to the listenersor the speaker during the utterance and yellow for avertedgaze

61 Eye gaze during referring expressionsWe looked at all references to objects (N = 582) and com-pared the means of the proportional gaze of the speaker tothe proportional listenersrsquo gaze to the proximity area of thereferent objects the rest of the gazed objects and gaze to-wards each other We conducted paired sample t-tests andfound significant difference between the speaker gaze to theproximity area to the referent object (M = 4669 StdError= 137) against the listener gaze on the area around the ref-erent object (M = 3945 StdError = 111) with [t = 4942p = 0001] There was no significant difference howeveron the speakerrsquos gaze to the listeners against the gaze fromthe listeners to the speaker [t = -0816 p = 0415] (mutualgaze)There was also a significant difference on the proportionalgaze of the speaker to the area around the referent object(M = 4669 StdError = 137) against the proportional gazeto other objects during the same utterances (M = 3241StdError = 125) [t = 5887 p = 0001] However nosignificant difference was found on the same case for thelistenersrsquo gaze [t = -0767 p = 0444]

62 Eye gaze before referring expressionsPrevious studies have revealed that speakers typically lookat referent objects about 800-1000ms before mentioningthem (Staudte and Crocker 2011) We therefore lookedat 1 second before the utterance for all utterances (N =582) and the gaze proportions around the area to the ob-ject(s) that were about the be referred to We compared thespeakerrsquos gaze to the referent objects to the listenersrsquo gazeusing paired sample t-tests The speakerrsquos gaze (M = 4972StdError = 152) was different than the listenerrsquos gaze (M= 2853 StdError = 118) [t = 12928 p = 0001] No sig-nificant difference was found on the mutual gaze during thetime before the referring expression [t = -1421 p = 0156]There was also a significant difference on the proportionalgaze of the speaker to the referent object (M = 4972StdError = 152) towards gaze on other objects (M =3307 StdError = 137) [t = 6154 p = 0001]Finally we compared the proportional gaze of the speakerto the referent object during the utterance and in the 1 sec-ond period before the utterance however there was no sig-nificant difference [t = -1854 p = 0064] There was adifference however as expected on the listeners gaze dur-ing (M = 3945 StdError = 111) and before (M = 285309StdError = 118) the utterance [t = 9745 p = 0001]

63 Eye gaze timingFurther we investigated when each interlocutor turned theirgaze to the object that was referred to verbally In 450 out ofthe 582 cases the speaker was already looking at the objectin a time window of 1 second prior to their verbal referenceto it On average during the 1s window they turned theirgaze to the object 0796s before uttering the referring ex-pression [t = 83157 p = 0001] which is supported by theliteratureAt least one of the listeners looked at the referent objectsduring the speakerrsquos utterance in 537 out of the 582 casesOn average they started looking at the proximity area of the

124

Figure 5 Mean proportional gaze per referring expressionfor speaker and listeners a) The speakerrsquos gaze during thetime of the reference b) The combined listenersrsquo gaze dur-ing the speakerrsquos referring expression c) The speakerrsquosgaze during the 1 second period before the reference d)The combined listenersrsquo gaze during the 1 second periodbefore the speakerrsquos referring expression

referent object 3445ms after the speaker started uttering areferring expression [t = 19355 p = 0001]

64 Mutual gaze and joint attentionWe extracted all occasions where the group was jointly at-tending the area around the referent object but also the oc-casions where either the speaker or the none of the listenerslooked at the object table 1

N = 582 BU [-10] DU [0t]None 58 21Both 339 479

Speaker only 111 24Listeners only 74 58

Table 1 Joint attention to the referent object area before(BU) and during (DU) speakerrsquos references

During the preliminary analysis of the corpus we noticedthat in many cases at least one of the listeners was alreadylooking at the area around the referent object before thespeaker would utter a referring expression (figure 5) Itwas our intuition that the salient object was already in thegrouprsquos attention before We looked at -1s before the ut-terance and automatically extracted these cases as can beseen on table 2

N = 582 BU [-1s]None 163Both 168

Speaker only 155Listeners only 96

Table 2 Count of occasions where the referent object hasalready been in focus of attention (-1s before speakerrsquos re-ferring expression)

Finally we looked at the occasions where the interlocu-tors established mutual gaze The table below shows caseswhere the speaker looked at one of the listeners or where atleast one of the listeners looked at the speaker during andbefore referring expressions Mutual gaze indicates wherethe speaker and one of the listeners look at each other andas can be seen this is very rare during or before referringexpressions

N = 582 BU [-10] DU [0t]Mutual gaze 10 20

Speaker at Listeners 52 72Listeners at Speaker 80 143

Table 3 Count of occasions where the speaker looked atthe listeners the listeners looked at the speaker and mutualgaze during and before the speakerrsquos referring expressions

7 DiscussionThe presented corpus contains recordings of a variety ofmultimodal data is processed and annotated and providesresearchers with the possibility to explore multi-modalmulti-party turn-taking behaviours in situated interactionWhile its strength lies in the rich annotation and the vari-ation in conversational dynamics it also has some limita-tionsOne of the limitations is the granularity of eye-gaze anno-tation Even though a high-end eye-gaze tracking systemwas used it was not always possible to disambiguate whichof the potential objects were being gazed at Given a visualangle threshold we are provided with a certain confidencemeasure defined by the angle difference to each object to-wards identifying the objects of attention That limited ouranalysis to the area around the object rather than a precisemeasure of the object itself Another limitation is that wedid not analyse the object movements or the participantsrsquopointing gestures These might explain some of the visualattention cases prior to the verbal referring expressionsMoreover the use of the eye tracking glasses and mo-tion capture equipment had the disadvantage that theywere quite intrusive Participants complained about fa-tigue at the end of the interactions and it was also quali-tatively observed that their gaze-head movement coordina-tion changed once wearing the glassesIn most occasions the listeners and the speakers were look-ing at the area of the referent object before the referringexpression was uttered which could potentially mean thatthe object was already on the grouprsquos attention In somecases the listenersrsquo visual attention was brought to the ob-ject area by referring language Similarly to the referredliterature this potentially shows that gaze has indicationson the salient objects in a grouprsquos discussion and that canbe used for reference resolution It is also our intuition thatobjects that establish higher fixation density during refer-ring expressions are considered to be more salient and canpotentially resolve the referencesThere are a few cases where neither the speaker nor thelistener looked at the referent objects in such cases textsaliency algorithms (Evangelopoulos et al 2009) or other

125

multimodal cues such as pointing gestures (Lucking et al2015) could be combined to resolve the reference to thesalient objectsTypically speakers direct the listenersrsquo attention usingverbal and non-verbal cues and listeners often read thespeakerrsquos visual attention during referring expressions toget indication on the referent objects As in literature wefound that speakers and listeners gazed at each other a lotduring the references to establish grounding on the referentobjects In very few cases however they also establishedmutual gaze (looking at each other at the same time) duringthose references

8 ConclusionsThe current paper presents a corpus of multi-party situatedinteraction It is fully transcribed and automatically anno-tated for eye-gaze gestures and spoken language More-over it features an automatic eye-gaze annotation methodwhere the participantrsquos gaze is resolved in 3d space a visu-alisation tool is also used to qualitatively examine parts orthe whole corpus in terms of conversational dynamics turntaking and reference resolution We annotated object ref-erences and investigated the proportional gaze from boththe perspective of the speaker and the listeners Finallywe quantitatively described the data of the corpus and gavefurther indications on how the corpus can be useful by theresearch community Both the annotated corpus and vi-sualisations are available at httpswwwkthseprofiledikopagematerial

9 AcknowledgementsThe authors would like to thank Joseph Mendelson formoderating the sessions and helping with the recordings ofthis corpus We would also like to acknowledge the supportfrom the EU Horizon 2020 project BabyRobot (687831)and the Swedish Foundation for Strategic Research projectFACT (GMT14-0082)

10 Bibliographical ReferencesAdmoni H and Scassellati B (2017) Social eye gaze in

human-robot interaction a review Journal of Human-Robot Interaction 6(1)25ndash63

Bohus D and Horvitz E (2009) Models for multipartyengagement in open-world dialog In Proceedings of theSIGDIAL 2009 Conference The 10th Annual Meeting ofthe Special Interest Group on Discourse and Dialoguepages 225ndash234 Association for Computational Linguis-tics

Bohus D and Horvitz E (2010) Facilitating multi-party dialog with gaze gesture and speech In Inter-national Conference on Multimodal Interfaces and theWorkshop on Machine Learning for Multimodal Interac-tion page 5 ACM

Bolt R A (1980) rsquoPut-that-therersquo Voice and gesture atthe graphics interface volume 14 ACM

Borji A and Itti L (2013) State-of-the-art in visual at-tention modeling IEEE transactions on pattern analysisand machine intelligence 35(1)185ndash207

Bruce N D and Tsotsos J K (2009) Saliency attentionand visual search An information theoretic approachJournal of vision 9(3)5ndash5

Carletta J (2007) Unleashing the killer corpus experi-ences in creating the multi-everything ami meeting cor-pus Language Resources and Evaluation 41(2)181ndash190

Evangelopoulos G Zlatintsi A Skoumas G Ra-pantzikos K Potamianos A Maragos P and AvrithisY (2009) Video event detection and summarization us-ing audio visual and text saliency In Acoustics Speechand Signal Processing 2009 ICASSP 2009 IEEE Inter-national Conference on pages 3553ndash3556 IEEE

Georgeton L and Meunier C (2015) Spontaneousspeech production by dysarthric and healthy speakersTemporal organisation and speaking rate In 18th Inter-national Congress of Phonetic Sciences Glasgow UKsubmitted

Goldberg L R (1992) The development of markers forthe big-five factor structure Psychological assessment4(1)26

Gross S Krenn B and Scheutz M (2017) The re-liability of non-verbal cues for situated reference res-olution and their interplay with language implicationsfor human robot interaction In Proceedings of the 19thACM International Conference on Multimodal Interac-tion pages 189ndash196 ACM

Grosz B J and Sidner C L (1986) Attention intentionsand the structure of discourse Computational linguis-tics 12(3)175ndash204

Hung H and Chittaranjan G (2010) The idiap wolf cor-pus exploring group behaviour in a competitive role-playing game In Proceedings of the 18th ACM interna-tional conference on Multimedia pages 879ndash882 ACM

Kennington C Kousidis S and Schlangen D (2013)Interpreting situated dialogue utterances an updatemodel that uses speech gaze and gesture informationProceedings of SIGdial 2013

Lucking A Bergmann K Hahn F Kopp S and RieserH (2010) The bielefeld speech and gesture alignmentcorpus (saga) In LREC 2010 workshop Multimodalcorporandashadvances in capturing coding and analyzingmultimodality

Lucking A Pfeiffer T and Rieser H (2015) Point-ing and reference reconsidered Journal of Pragmatics7756ndash79

Meena R Skantze G and Gustafson J (2014) Data-driven models for timing feedback responses in a maptask dialogue system Computer Speech amp Language28(4)903ndash922

Mehlmann G Haring M Janowski K Baur T Geb-hard P and Andre E (2014) Exploring a model ofgaze for grounding in multimodal hri In Proceedings ofthe 16th International Conference on Multimodal Inter-action pages 247ndash254 ACM

Mostefa D Moreau N Choukri K Potamianos GChu S M Tyagi A Casas J R Turmo J Cristofore-tti L Tobia F et al (2007) The chil audiovisual cor-

126

pus for lecture and meeting analysis inside smart roomsLanguage Resources and Evaluation 41(3-4)389ndash407

Oertel C Cummins F Edlund J Wagner P and Camp-bell N (2013) D64 A corpus of richly recorded con-versational interaction Journal on Multimodal User In-terfaces 7(1-2)19ndash28

Oertel C Funes Mora K A Sheikhi S Odobez J-Mand Gustafson J (2014) Who will get the grant Amultimodal corpus for the analysis of conversational be-haviours in group interviews In Proceedings of the 2014Workshop on Understanding and Modeling MultipartyMultimodal Interactions pages 27ndash32 ACM

Prasov Z and Chai J Y (2008) Whatrsquos in a gaze therole of eye-gaze in reference resolution in multimodalconversational interfaces In Proceedings of the 13thinternational conference on Intelligent user interfacespages 20ndash29 ACM

Prasov Z and Chai J Y (2010) Fusing eye gaze withspeech recognition hypotheses to resolve exophoric ref-erences in situated dialogue In Proceedings of the 2010Conference on Empirical Methods in Natural LanguageProcessing pages 471ndash481 Association for Computa-tional Linguistics

Rayner K (1995) Eye movements and cognitive pro-cesses in reading visual search and scene perceptionIn Studies in visual information processing volume 6pages 3ndash22 Elsevier

Schlangen D and Skantze G (2009) A general abstractmodel of incremental dialogue processing In Proceed-ings of the 12th Conference of the European Chapterof the Association for Computational Linguistics pages710ndash718 Association for Computational Linguistics

Sheikhi S and Odobez J-M (2012) Recognizing the vi-sual focus of attention for human robot interaction In In-ternational Workshop on Human Behavior Understand-ing pages 99ndash112 Springer

Skantze G Johansson M and Beskow J (2015) Ex-ploring turn-taking cues in multi-party human-robot dis-cussions about objects In Proceedings of the 2015 ACMon International Conference on Multimodal Interactionpages 67ndash74 ACM

Staudte M and Crocker M W (2011) Investigating jointattention mechanisms through spoken humanndashrobot in-teraction Cognition 120(2)268ndash291

Stefanov K and Beskow J (2016) A multi-party multi-modal dataset for focus of visual attention in human-human and human-robot interaction In Proceedings ofthe 10th edition of the Language Resources and Evalua-tion Conference (LREC 2016 23-28 of May ELRA

Sziklai G (1956) Some studies in the speed of visualperception IRE Transactions on Information Theory2(3)125ndash128

Winograd T (1972) Understanding natural languageCognitive psychology 3(1)1ndash191

127

  • Introduction
  • Background
    • Multiparty multimodal corpora
    • Social eye-gaze
    • Multimodal reference resolution
      • Data collection
        • Scenario
        • Participants
        • Corpus
        • Experimental setup
        • Motion capture
        • Eye gaze
        • Speech and language
        • Facial expressions
        • Application
        • Subjective measures
        • Calibration
          • Annotations
          • Data processing
            • Data synchronisation
            • Visualisation and visual angle threshold
            • Automatic annotation of gaze data
              • Results
                • Eye gaze during referring expressions
                • Eye gaze before referring expressions
                • Eye gaze timing
                • Mutual gaze and joint attention
                  • Discussion
                  • Conclusions
                  • Acknowledgements
                  • Bibliographical References
Page 2: A Multimodal Corpus for Mutual Gaze and Joint …of-the-art in multimodal multiparty corpora and relevant works in various types social eye gaze. We further describe our process in

2 Background21 Multiparty multimodal corporaOver the last decade more and more multimodal multi-party corpora have been created such as the ones describedin (Carletta 2007 Mostefa et al 2007 Oertel et al 2014Hung and Chittaranjan 2010 Oertel et al 2013 Stefanovand Beskow 2016) (Carletta 2007) and (Mostefa et al2007) fall into the category of meeting corpora (Hung andChittaranjan 2010) and (Stefanov and Beskow 2016) areexamples of games corpora and (Oertel et al 2014) a jobinterviewing corpus The corpus from (Oertel et al 2013)in contrast to the corpora listed above tries to escape the labenvironment and gathers data of multi-party interactions rdquointhe wildrdquoIn (Carletta 2007) (Mostefa et al 2007) and (Oertel et al2013) the visual focus of interlocutors is divided ie thereare stretches in the corpus in which interlocutorsrsquo main fo-cus of attention is on each other and there are stretchesin the corpus where they mainly focus on eg the whiteboard or sheets of papers which are laying in front of themHowever with the technical set-up used at the recordingsat the time it was hard to infer the exact points the partic-ipant were looking In (Oertel et al 2014) and (Hung andChittaranjan 2010) participantsrsquo visual focus of attentionis solely focused on the other participants (no other objectsare present during the recordings)Another example is (Stefanov and Beskow 2016) who intheir corpus study the visual focus of attention of groupsof participants For this they recorded groups of three par-ticipants while they were sorting cards They also recordedgroups of three participants discussing their travel expe-riences without any objects present that might distract thevisual focus of attentionFinally while (Lucking et al 2010) is not a multiparty cor-pus it should be mentioned here as it is similar to the cor-pus described in this paper particularly well suited for thestudy of referring expressions In terms of experimentalsetup the corpus recording described in this paper is mostsimilar to (Stefanov and Beskow 2016)

22 Social eye-gazeSocial eye gaze refers to the communicative cues of eyecontact between humans and is usually referred to by 4main types (Admoni and Scassellati 2017) 1 Mutual gazewhere both interlocutorsrsquo attention is directed at each other2) Joint attention where both interlocutors focus their at-tention on the same object or location 3) Referential gazewhich is directed to an object or location and often comestogether with referring language and 4) Gaze aversions thattypically avert from the main direction of gaze - ie theinterlocutorrsquos faceJoint attention is of particular importance for communica-tion It provides participants with the possibility to interpretand predict each otherrsquos actions and react accordingly Inthe current corpus for example the modeling of joint atten-tion is of particular importance as it provides participantswith the possibility to track the interlocutorsrsquo current focusof attention in the discourse (Grosz and Sidner 1986) Acommon quality of joint attention is that it may start with

mutual gaze to establish where is the attention of the inter-locutor and end towards the referential gaze direction to themost salient object (Admoni and Scassellati 2017)Also as participants are very likely to be focused more onthe display than each other joint attention will be crucialto discern whether they are paying attention to each otherModelling of joint attention is also crucial when developingfine-grained models of dialogue processing (Schlangen andSkantze 2009) which for example makes it possible fora dialogue system to give more timely feedback (Meenaet al 2014) With regards to multi-party interaction thereare also recent studies which model the situation in whichthe interaction takes place in order to manage several userstalking to the system at the same time (Bohus and Horvitz2010) and references to objects in the shared visual scene(Kennington et al 2013)

23 Multimodal reference resolutionA reference is typically a symbolic representation of a lin-guistic expression to a specific object or abstraction Durin-ing early attempts in verbal communications between hu-mans and machines researchers used rule-based systems todisambiguate words and map them to referent objects in vir-tual worlds (Winograd 1972) Research has also focused indisambiguating language using multimodal cues startingin the late 70s with Richard Boltrsquos rdquoPut-That-Thererdquo (Bolt1980) to recent approaches using eye-gaze (Mehlmann etal 2014 Prasov and Chai 2008 Prasov and Chai 2010)head pose (Skantze et al 2015) and pointing gestures(Lucking et al 2015) Gross et al recently explored thevariability and interplay of different multimodal cues in ref-erence resolution (Gross et al 2017)Eye gaze and head direction have been shown to be goodindicators of object saliency in human interactions re-searchers have developed computational methods to con-struct saliency maps and identify humansrsquo visual attention(Bruce and Tsotsos 2009 Sziklai 1956 Borji and Itti2013 Sheikhi and Odobez 2012) Typically speakers di-rect the listenersrsquo attention to objects using verbal and nonverbal cues and listeners often read the speakerrsquos visual at-tention during referring expressions to get indications onthe referent objects

3 Data collection31 ScenarioWe optimised for variation in conversational dynamics bydividing the current corpus recordings into two conditionsIn the first condition the moderator facilitated a discussionabout participantsrsquo experience on the topic of sharing anapartment In this condition no distracting objects were ex-istent and the screen on the large display was off In the sec-ond condition the participants were asked to collaborate ondecorating an apartment (Figure 2) that they should imag-ine they would be moving in together They were givenan empty flat in which they had to decide where to placefurniture which they could buy from the store The storeswere provided as extra screens on the application that theycould go through to get new pieces of furniture They weregiven a limited budget which fostered their decisions anddiscussions on what objects to choose Given the variety

120

of available objects they would need to compromise theirchoices and collectively decide where to place objects

Figure 2 The application running on the large display Theobjects were rdquoboughtrdquo from the furniture shop that had avariety of available furniture

32 ParticipantsWe collected data from 30 participants with a total of 15interactions Our moderator (native US-English speaker)was present on all 15 sessions facilitating the structure ofthe interactions and instructing participants on their role forcompleting the task The moderator was always at the samepart of the table and the two participants were sitting acrossthe moderator (Figure 1) The mean age of our participantswas 257 11 were female and 19 were male and the ma-jority of them were students or researchers at KTH RoyalInstitute of Technology in Stockholm Sweden

33 CorpusOur corpus consists of a total of 15 hours of recordings oftriadic interactions All recordings (roughly one hour each)contain data from various sensors capturing motion eyegaze audio and video streams Each session follows thesame structure The moderator welcomes the participantswith a brief discussion on the topic of moving in a flat withsomeone and thereafter introducing the setup and scenarioof the two participants planning their moving in the sameapartment using the large display During the interactionswe collected a set of multimodal data a variety of inputmodalities that we combined to get information on partici-pantsrsquo decision making and intentionsOut of the 15 sessions 2 sessions had no successful eyegaze calibration and were discarded One session had syn-chronisation issues on the screen application which wasalso discarded Last one session had large gaze data gapsand was discarded as well The rest 11 sessions of theaggregated and processed data from the corpus are fur-ther described on the rest of the paper and are availableat httpswwwkthseprofiledikopagematerial

34 Experimental setupParticipants were situated around a table which had a largetouch display On the display there was an application wedeveloped to facilitate their planning of a moving in to-gether in a flat scenario We gave participants a pair of

gloves with reflective markers and eye tracking glasses (To-bii Glasses 21) which also had reflective markers on to tracktheir position in space The room was surrounded with 17motion capture cameras positioned in such a way that bothgloves and glasses are always on the camerasrsquo sightThe moderator was also wearing eye tracking glassesSince our aim is to develop models for a robot withouthands the moderator was not wearing gloves with mark-ers and was instructed to avoid using hand gestures Therewere two cameras placed on the table capturing facial ex-pressions and a regular video camera at a distance recordingthe full interaction for annotation purposes On the glasseswe also placed lavalier microphones (one per participant)in such a way that we capture the subjectrsquos voice separatedfrom the rest of the subjectsrsquo speech and with a volumeconsistencyThe participants collaborated in the given scenario for 1hour where they discussed and negotiated to form a com-mon solution in apartment planning At the end of therecording we asked participants to fill a questionnaire ontheir perception of how the discussion went their negoti-ations with the other participant and finally some demo-graphic information All participants were reimbursed witha cinema ticket

35 Motion captureWe used an OptiTrack motion capture system2 to collectmotion data from the subjects The 17 motion capturecameras collected motion from reflective markers on 120frames per second To identify rigid objects in the 3d spacewe placed 4 markers per object of interest (glasses glovesdisplay) and captured position (x y z) and rotation (x y zw) for each rigid object While 3 markers are sufficient forcapturing the position of a single rigid body we placed a4th marker on each object for robustness That way if oneof the markers was not captured we would still identify therigid object in space

36 Eye gazeWe were interested in collecting eye gaze data for each par-ticipant in order to model referential gaze joint attentionmutual gaze and gaze aversion in multiparty situated dia-logue (figure 3) In order to capture eye gaze in 3D spacewe used eye tracking glasses with motion capture mark-ers This made it possible to accurately identify when aparticipantrsquos gaze trajectory intersected objects or the otherinterlocutors It also made it possilbe to capture gaze aver-sion ie when participants gazed away when speaking orlistening to one of the participantsGaze samples were on 50Hz and the data was captured bytracking the subjectsrsquo pupil movements and pupil size anda video from their point of reference We placed reflectivemarkers on each pair of glasses and were therefore able toidentify their gaze trajectory in x and y on their point ofreference and then resolve it to world coordinates using theglassesrsquo relevant position in 3d spaceThe glasses using triangulation from both eyesrsquo positionsalso provided a z value which would refer to the point

1httpwwwtobiiprocom2httpoptitrackcom

121

where the trajectories from the two eyes meet That pointin space (x y z) we used to resolve the eye gaze trajectoryin 3d space

Figure 3 By combining synchronised data from motioncapture and eye tracking glasses we captured the partici-pantsrsquo eye gaze in 3d space but also head pose and gestures

37 Speech and languageWe recorded audio from each participant using channel sep-arated lavalier microphones attached to the eye trackingglasses Each microphone captured speech that we laterused to automatically transcribe using speech recognitionand resolve the spoken utterances to text We used a voiceactivity detection filter in all channels to separate capturedspeech from other speakers We further used Tensorflowrsquos3

neural network parser to form syntactic representations ofthe automatically transcribed natural language text

38 Facial expressionsIn order to extract facial expressions we placed GoPro cam-eras in front of the participants and moderator on the tableWe also recorded each session from a camera placed on atripod further in the room to capture the interaction as awhole and for later usage in annotating the corpus

39 ApplicationWe built an application to enable the interaction with sev-eral virtual objects (figure 2) The application consisted oftwo main views a floor plan and a furniture shop The floorplan displayed on the large multi-touch display was used bythe participants during their interactions as the main area tomanipulate the objects while the shops were used to collectnew objectsThe shop screens were divided by room categories such askitchen or living room They naturally induced deictic ex-pressions as the participants referred to the objects duringtheir negotiations using both language and referential gaze(ie rdquoitrdquo the deskrdquo rdquothe bedrdquo) The floor plan on the con-trary was the main display were they could manipulate ob-jects by placing them in rooms and used referring languageto these virtual locations (ie rdquohererdquo rdquothererdquo rdquomy roomrdquo)

3httpswwwtensoroworgversionsr012tutorialssyntaxnet

The floor plan view also displayed the apartment scale andthe budget the participants could spend for objects Thebudget was limited and after initial pilots it was decidedto have a value that would be enough to satisfy both partici-pantsrsquo furniture selections but also limited enough to fosternegotiations on what objects they would select They werealso allowed to only choose one of each item in order tofoster negotiations furtherThe application maintained event-based logging at 5 fps tokeep track of the interaction flow object movement and al-locations Each event carried time stamps and relevant ap-plication state as well as positions and rotation data for allobjects By placing markers in the corners of the screenit could be placed into the motion capture coordinate sys-tem This allowed us to get the virtual object events in 3Dspace and capture the participantsrsquo visual attention to theseobjects

310 Subjective measuresAt the end of each session we gave participants a question-naire to measure their impression on how the discussionwent and how well they thought they had collaborated whendecorating the appartment We used these measures to mea-sure dominance as well as collaborative behaviour and per-sonality All participants were asked to fill personality testsbefore coming to the lab the tests included introversion andextroversion measures (Goldberg 1992)

311 CalibrationThe sensors we used required calibration in order to suc-cessfully capture motion and eye movements We cali-brated all 17 cameras positioning at the beginning of allrecordings while the eye tracking glasses required calibra-tion on each subject separately That is due to the fact thateach participantrsquos eye positions vary but also how their eyesmove during saccades

4 AnnotationsWe annotated referring expressions to objects on the dis-play by looking at what object the speaker intended to referto Speakers typically drew their interlocutorsrsquo attention toobjects using deictic expressions referring language refer-ential gaze as well as mutual gaze For every dialogue therewas a set of references to objects not included in our simu-lated environment -rdquoI also have a fireplace in my flat but Ido not use it a lotrdquo However we only annotated referencesthat can be resolved to objects existing in our simulated en-vironment ie the application on the touch display Thereferences we can resolve in this environment are thereforeonly a subset of all possible references in the dialogueFor the annotations we used the videos of the interactionstogether with the ASR transcriptions and the current stateof the app (visible objects or current screen on the large dis-play) We defined each referring expression by looking atthe time of the speakerrsquos utterance The timing was definedas from the ASR transcriptions that were synchronised withthe gaze and gesture data Utterances were split into in-ter pausal units (IPUs) that were separated by silent pauseslonger than 250ms (Georgeton and Meunier 2015)

122

The ASR transcriptions were used as input to the Tensor-flow syntactic parser where we extracted the part-of-speech(POS) tags Similarly to (Gross et al 2017) as linguisticindicators we used full or elliptic noun-phrases such as rdquothesofardquo rdquosofardquo rdquobedroomrdquo pronouns such as rdquoitrdquo or rdquothisrdquoand rdquothatrdquo possessive pronouns such as rdquomyrdquo rdquominerdquo andspatial indexicals such as rdquohererdquo or rdquothererdquo The ID of thesalient object or location of reference was identified andsaved on the annotations In some cases there was onlyone object referred to by the speaker but there were alsocases where the speaker referred to more than one object inone utterance (ie rdquothe table and the chairsrdquo) The roles ofthe speaker and listeners varied in each expression In totalwe annotated 766 referring expressions out of 5 sessionsroughly 5 hours of recordings which was about onethird ofour corpus

5 Data processingThe variety of modalities was post-processed and analysedwhere in particular combined eye gaze and motion cap-ture provided an accurate estimation of gaze pointing in3D space We used a single-world coordinate system inmeters on both eye gaze and motion capture input streamsand we identified the gaze trajectories head pose and handgestures We started with a low-level multimodal fusionof the input sensor streams and further aggregated the datatogether with speech and language to high-level sequencesof each participantrsquos speech and visual attention Each datapoint had automatic speech transcriptions syntactic parsingof these transcriptions gesture data in the form of movingobjects and eye gaze in the form of target object or person

51 Data synchronisationWe used sound pulses from each device to sync all signalsin a session The motion capture system sent a pulse oneach frame captured while the eye tracking glasses sent 3pulses every 10 seconds The application also sent a pulseevery 10 seconds All sound signals were collected on thesame 12 channel sound card which would then mark thereference point in time for each sessionAudio signals were also captured on the soundcard there-fore we were able to identify a reference point that wouldset the start time for each recording That was the last of thesensors to start which was one of the eye tracking glassesApart from the sound pulses we used a clapperboard withreflective markers on its closed position which would markthe start and the end of each session We used that to syncthe video signals and as a safety point in case one of thesound pulses failed to start Since the application on thedisplay sent sync signals we used it to mark the separationin time of the two experimental conditions The first one(free discussion) ended when the app started which wouldlead to the second condition (task-oriented dialogue)

52 Visualisation and visual angle thresholdAs illustrated in figure 3 we calculated the eye gaze trajec-tories by combining motion capture data and eye trackingdata per frame We implemented a visualisation tool4 in

4Available at httpswwwkthseprofiledikopagematerial

WebGL to verify and evaluate the calculated eye gaze in 3dspace Using this tool we were able to qualitatively inves-tigate the sections of multimodal turn taking behaviour ofparticipants by visualising their position and gaze at objectsor persons along with their transcribed speech and audio inwav format We also used this tool to empirically definethe visual angle threshold for each eye tracker on the accu-racy of gaze targets During pilots we asked participants tolook at different objects ensuring there are angle variationsto identify the threshold of the gaze angle to the dislocationvector (between the glasses and the salient object)

53 Automatic annotation of gaze dataWe extracted eye-gaze data for all participants and all ses-sions and noticed that there were gaps between gaze datapoints quite often and in some sessions more than in othersWe applied smoothing to eliminate outliers and interpolatedgaps in a 12 frame step (100ms on our 120 fps data) Typ-ically an eye fixation is 250ms but no smaller than 100ms(Rayner 1995) which defined our smoothing strategy Thedata was then filtered in the same time window to only pro-vide data points with fixations and not include saccades oreye blinksThere were however gaps that were longer than 12 framescaused by lack of data from the eye trackers In suchcases the eye trackers had no information on the eyesrsquo po-sitions which means that there was no knowledge on if aspeakerlistener looked at the referent object In such casesof no gaze data we went through the manually annotatedreferring expressions and checked the relevant frames ofthe gaze data If at least one of the participants had no gazedata we would discard the relevant referring expressionfrom our analysis After cleaning those cases we had re-maining 582 expressions out of the 766 initially annotated5

The average error of gaze data loss we had was 408 Thesession with the max gaze error rate was 715 while themin was 269During an referring expression participantsrsquo visual atten-tion spanned through a) a variety of visible objects on thescreen b) their interlocutors or c) none of the above whichwe assumed on this corpus to be gaze aversion The promi-nent objects of visual attention were identified by calculat-ing each participantrsquos visual angle α between their gazevector g to the dislocation vector o for every visible objecton the screen for every frame

αij = arccos

(~gij middot ~oij| ~gij | | ~oij |

)(1)

Similarly we approximated each personrsquos head with asphere with radius of 02m and automatically annotated allframes where the gaze vector intersected the sphere as vi-sual attention towards that personFinally after filtering for eye fixations we calculated theproportional gaze likelihood per object

P (oi | ti) =c(oi ti)

c(ti)(2)

5The 5 sessions chosen for annotation were the ones with thesmallest percentage of gaze data loss

123

For each object oi we gathered the proportional gaze datapoints and counted the amount of time the object was gazedduring the time t of an utterance As a gaze target we de-fined the area around any object that is within the thresh-old of the visual angle defined above In many cases morethan one objects competed for saliency in the same gazetarget area We therefore defined as joint attention not thegroup of objects gazed by the interlocutors but the gaze atthe same area around the prominent object given the visualangle threshold

6 Results

In the current section we describe gaze patterns as ob-served between and across the two conditions and propor-tional gaze during and before referring expressions In fig-ure 4 we show the proportional eye-gaze of the moderatoras well as the participantsrsquo eye gaze on the display As canbe observed the participants spent more time gazing on thedisplay than at the moderator or at each other which wasexpected to observe during a task-oriented dialogue

Figure 4 Proportional amount of participant eye-gaze persession on the display during the second condition (task-oriented dialogue)

For each utterance a set of prominent objects was definedfrom the annotations Given the gaze targets per partici-pant we calculated the proportional gaze from the speakerand the listeners during the time of the utterance and ex-actly 1 second before the utterance Since all interactionswere triadic there were two listeners and one speaker at alltimes To compare across all utterances we looked at themean proportional gaze of the two listeners to the area ofthe prominent object to define them as the listener groupWe then compared the gaze of the listener group to thespeakers gaze close to the referent objects We also lookedat the proportional combined gaze to other objects duringthe utterance (all other objects that have been gazed at dur-ing the utterance) gaze at the speaker and averted gaze Infigure 5 the blue colour refers to the proximity area of thereferent object from the speakerrsquos utterance while orangerefers to the gaze at all other known objects combined onthe virtual environment Grey is for the gaze to the listenersor the speaker during the utterance and yellow for avertedgaze

61 Eye gaze during referring expressionsWe looked at all references to objects (N = 582) and com-pared the means of the proportional gaze of the speaker tothe proportional listenersrsquo gaze to the proximity area of thereferent objects the rest of the gazed objects and gaze to-wards each other We conducted paired sample t-tests andfound significant difference between the speaker gaze to theproximity area to the referent object (M = 4669 StdError= 137) against the listener gaze on the area around the ref-erent object (M = 3945 StdError = 111) with [t = 4942p = 0001] There was no significant difference howeveron the speakerrsquos gaze to the listeners against the gaze fromthe listeners to the speaker [t = -0816 p = 0415] (mutualgaze)There was also a significant difference on the proportionalgaze of the speaker to the area around the referent object(M = 4669 StdError = 137) against the proportional gazeto other objects during the same utterances (M = 3241StdError = 125) [t = 5887 p = 0001] However nosignificant difference was found on the same case for thelistenersrsquo gaze [t = -0767 p = 0444]

62 Eye gaze before referring expressionsPrevious studies have revealed that speakers typically lookat referent objects about 800-1000ms before mentioningthem (Staudte and Crocker 2011) We therefore lookedat 1 second before the utterance for all utterances (N =582) and the gaze proportions around the area to the ob-ject(s) that were about the be referred to We compared thespeakerrsquos gaze to the referent objects to the listenersrsquo gazeusing paired sample t-tests The speakerrsquos gaze (M = 4972StdError = 152) was different than the listenerrsquos gaze (M= 2853 StdError = 118) [t = 12928 p = 0001] No sig-nificant difference was found on the mutual gaze during thetime before the referring expression [t = -1421 p = 0156]There was also a significant difference on the proportionalgaze of the speaker to the referent object (M = 4972StdError = 152) towards gaze on other objects (M =3307 StdError = 137) [t = 6154 p = 0001]Finally we compared the proportional gaze of the speakerto the referent object during the utterance and in the 1 sec-ond period before the utterance however there was no sig-nificant difference [t = -1854 p = 0064] There was adifference however as expected on the listeners gaze dur-ing (M = 3945 StdError = 111) and before (M = 285309StdError = 118) the utterance [t = 9745 p = 0001]

63 Eye gaze timingFurther we investigated when each interlocutor turned theirgaze to the object that was referred to verbally In 450 out ofthe 582 cases the speaker was already looking at the objectin a time window of 1 second prior to their verbal referenceto it On average during the 1s window they turned theirgaze to the object 0796s before uttering the referring ex-pression [t = 83157 p = 0001] which is supported by theliteratureAt least one of the listeners looked at the referent objectsduring the speakerrsquos utterance in 537 out of the 582 casesOn average they started looking at the proximity area of the

124

Figure 5 Mean proportional gaze per referring expressionfor speaker and listeners a) The speakerrsquos gaze during thetime of the reference b) The combined listenersrsquo gaze dur-ing the speakerrsquos referring expression c) The speakerrsquosgaze during the 1 second period before the reference d)The combined listenersrsquo gaze during the 1 second periodbefore the speakerrsquos referring expression

referent object 3445ms after the speaker started uttering areferring expression [t = 19355 p = 0001]

64 Mutual gaze and joint attentionWe extracted all occasions where the group was jointly at-tending the area around the referent object but also the oc-casions where either the speaker or the none of the listenerslooked at the object table 1

N = 582 BU [-10] DU [0t]None 58 21Both 339 479

Speaker only 111 24Listeners only 74 58

Table 1 Joint attention to the referent object area before(BU) and during (DU) speakerrsquos references

During the preliminary analysis of the corpus we noticedthat in many cases at least one of the listeners was alreadylooking at the area around the referent object before thespeaker would utter a referring expression (figure 5) Itwas our intuition that the salient object was already in thegrouprsquos attention before We looked at -1s before the ut-terance and automatically extracted these cases as can beseen on table 2

N = 582 BU [-1s]None 163Both 168

Speaker only 155Listeners only 96

Table 2 Count of occasions where the referent object hasalready been in focus of attention (-1s before speakerrsquos re-ferring expression)

Finally we looked at the occasions where the interlocu-tors established mutual gaze The table below shows caseswhere the speaker looked at one of the listeners or where atleast one of the listeners looked at the speaker during andbefore referring expressions Mutual gaze indicates wherethe speaker and one of the listeners look at each other andas can be seen this is very rare during or before referringexpressions

N = 582 BU [-10] DU [0t]Mutual gaze 10 20

Speaker at Listeners 52 72Listeners at Speaker 80 143

Table 3 Count of occasions where the speaker looked atthe listeners the listeners looked at the speaker and mutualgaze during and before the speakerrsquos referring expressions

7 DiscussionThe presented corpus contains recordings of a variety ofmultimodal data is processed and annotated and providesresearchers with the possibility to explore multi-modalmulti-party turn-taking behaviours in situated interactionWhile its strength lies in the rich annotation and the vari-ation in conversational dynamics it also has some limita-tionsOne of the limitations is the granularity of eye-gaze anno-tation Even though a high-end eye-gaze tracking systemwas used it was not always possible to disambiguate whichof the potential objects were being gazed at Given a visualangle threshold we are provided with a certain confidencemeasure defined by the angle difference to each object to-wards identifying the objects of attention That limited ouranalysis to the area around the object rather than a precisemeasure of the object itself Another limitation is that wedid not analyse the object movements or the participantsrsquopointing gestures These might explain some of the visualattention cases prior to the verbal referring expressionsMoreover the use of the eye tracking glasses and mo-tion capture equipment had the disadvantage that theywere quite intrusive Participants complained about fa-tigue at the end of the interactions and it was also quali-tatively observed that their gaze-head movement coordina-tion changed once wearing the glassesIn most occasions the listeners and the speakers were look-ing at the area of the referent object before the referringexpression was uttered which could potentially mean thatthe object was already on the grouprsquos attention In somecases the listenersrsquo visual attention was brought to the ob-ject area by referring language Similarly to the referredliterature this potentially shows that gaze has indicationson the salient objects in a grouprsquos discussion and that canbe used for reference resolution It is also our intuition thatobjects that establish higher fixation density during refer-ring expressions are considered to be more salient and canpotentially resolve the referencesThere are a few cases where neither the speaker nor thelistener looked at the referent objects in such cases textsaliency algorithms (Evangelopoulos et al 2009) or other

125

multimodal cues such as pointing gestures (Lucking et al2015) could be combined to resolve the reference to thesalient objectsTypically speakers direct the listenersrsquo attention usingverbal and non-verbal cues and listeners often read thespeakerrsquos visual attention during referring expressions toget indication on the referent objects As in literature wefound that speakers and listeners gazed at each other a lotduring the references to establish grounding on the referentobjects In very few cases however they also establishedmutual gaze (looking at each other at the same time) duringthose references

8 ConclusionsThe current paper presents a corpus of multi-party situatedinteraction It is fully transcribed and automatically anno-tated for eye-gaze gestures and spoken language More-over it features an automatic eye-gaze annotation methodwhere the participantrsquos gaze is resolved in 3d space a visu-alisation tool is also used to qualitatively examine parts orthe whole corpus in terms of conversational dynamics turntaking and reference resolution We annotated object ref-erences and investigated the proportional gaze from boththe perspective of the speaker and the listeners Finallywe quantitatively described the data of the corpus and gavefurther indications on how the corpus can be useful by theresearch community Both the annotated corpus and vi-sualisations are available at httpswwwkthseprofiledikopagematerial

9 AcknowledgementsThe authors would like to thank Joseph Mendelson formoderating the sessions and helping with the recordings ofthis corpus We would also like to acknowledge the supportfrom the EU Horizon 2020 project BabyRobot (687831)and the Swedish Foundation for Strategic Research projectFACT (GMT14-0082)

10 Bibliographical ReferencesAdmoni H and Scassellati B (2017) Social eye gaze in

human-robot interaction a review Journal of Human-Robot Interaction 6(1)25ndash63

Bohus D and Horvitz E (2009) Models for multipartyengagement in open-world dialog In Proceedings of theSIGDIAL 2009 Conference The 10th Annual Meeting ofthe Special Interest Group on Discourse and Dialoguepages 225ndash234 Association for Computational Linguis-tics

Bohus D and Horvitz E (2010) Facilitating multi-party dialog with gaze gesture and speech In Inter-national Conference on Multimodal Interfaces and theWorkshop on Machine Learning for Multimodal Interac-tion page 5 ACM

Bolt R A (1980) rsquoPut-that-therersquo Voice and gesture atthe graphics interface volume 14 ACM

Borji A and Itti L (2013) State-of-the-art in visual at-tention modeling IEEE transactions on pattern analysisand machine intelligence 35(1)185ndash207

Bruce N D and Tsotsos J K (2009) Saliency attentionand visual search An information theoretic approachJournal of vision 9(3)5ndash5

Carletta J (2007) Unleashing the killer corpus experi-ences in creating the multi-everything ami meeting cor-pus Language Resources and Evaluation 41(2)181ndash190

Evangelopoulos G Zlatintsi A Skoumas G Ra-pantzikos K Potamianos A Maragos P and AvrithisY (2009) Video event detection and summarization us-ing audio visual and text saliency In Acoustics Speechand Signal Processing 2009 ICASSP 2009 IEEE Inter-national Conference on pages 3553ndash3556 IEEE

Georgeton L and Meunier C (2015) Spontaneousspeech production by dysarthric and healthy speakersTemporal organisation and speaking rate In 18th Inter-national Congress of Phonetic Sciences Glasgow UKsubmitted

Goldberg L R (1992) The development of markers forthe big-five factor structure Psychological assessment4(1)26

Gross S Krenn B and Scheutz M (2017) The re-liability of non-verbal cues for situated reference res-olution and their interplay with language implicationsfor human robot interaction In Proceedings of the 19thACM International Conference on Multimodal Interac-tion pages 189ndash196 ACM

Grosz B J and Sidner C L (1986) Attention intentionsand the structure of discourse Computational linguis-tics 12(3)175ndash204

Hung H and Chittaranjan G (2010) The idiap wolf cor-pus exploring group behaviour in a competitive role-playing game In Proceedings of the 18th ACM interna-tional conference on Multimedia pages 879ndash882 ACM

Kennington C Kousidis S and Schlangen D (2013)Interpreting situated dialogue utterances an updatemodel that uses speech gaze and gesture informationProceedings of SIGdial 2013

Lucking A Bergmann K Hahn F Kopp S and RieserH (2010) The bielefeld speech and gesture alignmentcorpus (saga) In LREC 2010 workshop Multimodalcorporandashadvances in capturing coding and analyzingmultimodality

Lucking A Pfeiffer T and Rieser H (2015) Point-ing and reference reconsidered Journal of Pragmatics7756ndash79

Meena R Skantze G and Gustafson J (2014) Data-driven models for timing feedback responses in a maptask dialogue system Computer Speech amp Language28(4)903ndash922

Mehlmann G Haring M Janowski K Baur T Geb-hard P and Andre E (2014) Exploring a model ofgaze for grounding in multimodal hri In Proceedings ofthe 16th International Conference on Multimodal Inter-action pages 247ndash254 ACM

Mostefa D Moreau N Choukri K Potamianos GChu S M Tyagi A Casas J R Turmo J Cristofore-tti L Tobia F et al (2007) The chil audiovisual cor-

126

pus for lecture and meeting analysis inside smart roomsLanguage Resources and Evaluation 41(3-4)389ndash407

Oertel C Cummins F Edlund J Wagner P and Camp-bell N (2013) D64 A corpus of richly recorded con-versational interaction Journal on Multimodal User In-terfaces 7(1-2)19ndash28

Oertel C Funes Mora K A Sheikhi S Odobez J-Mand Gustafson J (2014) Who will get the grant Amultimodal corpus for the analysis of conversational be-haviours in group interviews In Proceedings of the 2014Workshop on Understanding and Modeling MultipartyMultimodal Interactions pages 27ndash32 ACM

Prasov Z and Chai J Y (2008) Whatrsquos in a gaze therole of eye-gaze in reference resolution in multimodalconversational interfaces In Proceedings of the 13thinternational conference on Intelligent user interfacespages 20ndash29 ACM

Prasov Z and Chai J Y (2010) Fusing eye gaze withspeech recognition hypotheses to resolve exophoric ref-erences in situated dialogue In Proceedings of the 2010Conference on Empirical Methods in Natural LanguageProcessing pages 471ndash481 Association for Computa-tional Linguistics

Rayner K (1995) Eye movements and cognitive pro-cesses in reading visual search and scene perceptionIn Studies in visual information processing volume 6pages 3ndash22 Elsevier

Schlangen D and Skantze G (2009) A general abstractmodel of incremental dialogue processing In Proceed-ings of the 12th Conference of the European Chapterof the Association for Computational Linguistics pages710ndash718 Association for Computational Linguistics

Sheikhi S and Odobez J-M (2012) Recognizing the vi-sual focus of attention for human robot interaction In In-ternational Workshop on Human Behavior Understand-ing pages 99ndash112 Springer

Skantze G Johansson M and Beskow J (2015) Ex-ploring turn-taking cues in multi-party human-robot dis-cussions about objects In Proceedings of the 2015 ACMon International Conference on Multimodal Interactionpages 67ndash74 ACM

Staudte M and Crocker M W (2011) Investigating jointattention mechanisms through spoken humanndashrobot in-teraction Cognition 120(2)268ndash291

Stefanov K and Beskow J (2016) A multi-party multi-modal dataset for focus of visual attention in human-human and human-robot interaction In Proceedings ofthe 10th edition of the Language Resources and Evalua-tion Conference (LREC 2016 23-28 of May ELRA

Sziklai G (1956) Some studies in the speed of visualperception IRE Transactions on Information Theory2(3)125ndash128

Winograd T (1972) Understanding natural languageCognitive psychology 3(1)1ndash191

127

  • Introduction
  • Background
    • Multiparty multimodal corpora
    • Social eye-gaze
    • Multimodal reference resolution
      • Data collection
        • Scenario
        • Participants
        • Corpus
        • Experimental setup
        • Motion capture
        • Eye gaze
        • Speech and language
        • Facial expressions
        • Application
        • Subjective measures
        • Calibration
          • Annotations
          • Data processing
            • Data synchronisation
            • Visualisation and visual angle threshold
            • Automatic annotation of gaze data
              • Results
                • Eye gaze during referring expressions
                • Eye gaze before referring expressions
                • Eye gaze timing
                • Mutual gaze and joint attention
                  • Discussion
                  • Conclusions
                  • Acknowledgements
                  • Bibliographical References
Page 3: A Multimodal Corpus for Mutual Gaze and Joint …of-the-art in multimodal multiparty corpora and relevant works in various types social eye gaze. We further describe our process in

of available objects they would need to compromise theirchoices and collectively decide where to place objects

Figure 2 The application running on the large display Theobjects were rdquoboughtrdquo from the furniture shop that had avariety of available furniture

32 ParticipantsWe collected data from 30 participants with a total of 15interactions Our moderator (native US-English speaker)was present on all 15 sessions facilitating the structure ofthe interactions and instructing participants on their role forcompleting the task The moderator was always at the samepart of the table and the two participants were sitting acrossthe moderator (Figure 1) The mean age of our participantswas 257 11 were female and 19 were male and the ma-jority of them were students or researchers at KTH RoyalInstitute of Technology in Stockholm Sweden

33 CorpusOur corpus consists of a total of 15 hours of recordings oftriadic interactions All recordings (roughly one hour each)contain data from various sensors capturing motion eyegaze audio and video streams Each session follows thesame structure The moderator welcomes the participantswith a brief discussion on the topic of moving in a flat withsomeone and thereafter introducing the setup and scenarioof the two participants planning their moving in the sameapartment using the large display During the interactionswe collected a set of multimodal data a variety of inputmodalities that we combined to get information on partici-pantsrsquo decision making and intentionsOut of the 15 sessions 2 sessions had no successful eyegaze calibration and were discarded One session had syn-chronisation issues on the screen application which wasalso discarded Last one session had large gaze data gapsand was discarded as well The rest 11 sessions of theaggregated and processed data from the corpus are fur-ther described on the rest of the paper and are availableat httpswwwkthseprofiledikopagematerial

34 Experimental setupParticipants were situated around a table which had a largetouch display On the display there was an application wedeveloped to facilitate their planning of a moving in to-gether in a flat scenario We gave participants a pair of

gloves with reflective markers and eye tracking glasses (To-bii Glasses 21) which also had reflective markers on to tracktheir position in space The room was surrounded with 17motion capture cameras positioned in such a way that bothgloves and glasses are always on the camerasrsquo sightThe moderator was also wearing eye tracking glassesSince our aim is to develop models for a robot withouthands the moderator was not wearing gloves with mark-ers and was instructed to avoid using hand gestures Therewere two cameras placed on the table capturing facial ex-pressions and a regular video camera at a distance recordingthe full interaction for annotation purposes On the glasseswe also placed lavalier microphones (one per participant)in such a way that we capture the subjectrsquos voice separatedfrom the rest of the subjectsrsquo speech and with a volumeconsistencyThe participants collaborated in the given scenario for 1hour where they discussed and negotiated to form a com-mon solution in apartment planning At the end of therecording we asked participants to fill a questionnaire ontheir perception of how the discussion went their negoti-ations with the other participant and finally some demo-graphic information All participants were reimbursed witha cinema ticket

35 Motion captureWe used an OptiTrack motion capture system2 to collectmotion data from the subjects The 17 motion capturecameras collected motion from reflective markers on 120frames per second To identify rigid objects in the 3d spacewe placed 4 markers per object of interest (glasses glovesdisplay) and captured position (x y z) and rotation (x y zw) for each rigid object While 3 markers are sufficient forcapturing the position of a single rigid body we placed a4th marker on each object for robustness That way if oneof the markers was not captured we would still identify therigid object in space

36 Eye gazeWe were interested in collecting eye gaze data for each par-ticipant in order to model referential gaze joint attentionmutual gaze and gaze aversion in multiparty situated dia-logue (figure 3) In order to capture eye gaze in 3D spacewe used eye tracking glasses with motion capture mark-ers This made it possible to accurately identify when aparticipantrsquos gaze trajectory intersected objects or the otherinterlocutors It also made it possilbe to capture gaze aver-sion ie when participants gazed away when speaking orlistening to one of the participantsGaze samples were on 50Hz and the data was captured bytracking the subjectsrsquo pupil movements and pupil size anda video from their point of reference We placed reflectivemarkers on each pair of glasses and were therefore able toidentify their gaze trajectory in x and y on their point ofreference and then resolve it to world coordinates using theglassesrsquo relevant position in 3d spaceThe glasses using triangulation from both eyesrsquo positionsalso provided a z value which would refer to the point

1httpwwwtobiiprocom2httpoptitrackcom

121

where the trajectories from the two eyes meet That pointin space (x y z) we used to resolve the eye gaze trajectoryin 3d space

Figure 3 By combining synchronised data from motioncapture and eye tracking glasses we captured the partici-pantsrsquo eye gaze in 3d space but also head pose and gestures

37 Speech and languageWe recorded audio from each participant using channel sep-arated lavalier microphones attached to the eye trackingglasses Each microphone captured speech that we laterused to automatically transcribe using speech recognitionand resolve the spoken utterances to text We used a voiceactivity detection filter in all channels to separate capturedspeech from other speakers We further used Tensorflowrsquos3

neural network parser to form syntactic representations ofthe automatically transcribed natural language text

38 Facial expressionsIn order to extract facial expressions we placed GoPro cam-eras in front of the participants and moderator on the tableWe also recorded each session from a camera placed on atripod further in the room to capture the interaction as awhole and for later usage in annotating the corpus

39 ApplicationWe built an application to enable the interaction with sev-eral virtual objects (figure 2) The application consisted oftwo main views a floor plan and a furniture shop The floorplan displayed on the large multi-touch display was used bythe participants during their interactions as the main area tomanipulate the objects while the shops were used to collectnew objectsThe shop screens were divided by room categories such askitchen or living room They naturally induced deictic ex-pressions as the participants referred to the objects duringtheir negotiations using both language and referential gaze(ie rdquoitrdquo the deskrdquo rdquothe bedrdquo) The floor plan on the con-trary was the main display were they could manipulate ob-jects by placing them in rooms and used referring languageto these virtual locations (ie rdquohererdquo rdquothererdquo rdquomy roomrdquo)

3httpswwwtensoroworgversionsr012tutorialssyntaxnet

The floor plan view also displayed the apartment scale andthe budget the participants could spend for objects Thebudget was limited and after initial pilots it was decidedto have a value that would be enough to satisfy both partici-pantsrsquo furniture selections but also limited enough to fosternegotiations on what objects they would select They werealso allowed to only choose one of each item in order tofoster negotiations furtherThe application maintained event-based logging at 5 fps tokeep track of the interaction flow object movement and al-locations Each event carried time stamps and relevant ap-plication state as well as positions and rotation data for allobjects By placing markers in the corners of the screenit could be placed into the motion capture coordinate sys-tem This allowed us to get the virtual object events in 3Dspace and capture the participantsrsquo visual attention to theseobjects

310 Subjective measuresAt the end of each session we gave participants a question-naire to measure their impression on how the discussionwent and how well they thought they had collaborated whendecorating the appartment We used these measures to mea-sure dominance as well as collaborative behaviour and per-sonality All participants were asked to fill personality testsbefore coming to the lab the tests included introversion andextroversion measures (Goldberg 1992)

311 CalibrationThe sensors we used required calibration in order to suc-cessfully capture motion and eye movements We cali-brated all 17 cameras positioning at the beginning of allrecordings while the eye tracking glasses required calibra-tion on each subject separately That is due to the fact thateach participantrsquos eye positions vary but also how their eyesmove during saccades

4 AnnotationsWe annotated referring expressions to objects on the dis-play by looking at what object the speaker intended to referto Speakers typically drew their interlocutorsrsquo attention toobjects using deictic expressions referring language refer-ential gaze as well as mutual gaze For every dialogue therewas a set of references to objects not included in our simu-lated environment -rdquoI also have a fireplace in my flat but Ido not use it a lotrdquo However we only annotated referencesthat can be resolved to objects existing in our simulated en-vironment ie the application on the touch display Thereferences we can resolve in this environment are thereforeonly a subset of all possible references in the dialogueFor the annotations we used the videos of the interactionstogether with the ASR transcriptions and the current stateof the app (visible objects or current screen on the large dis-play) We defined each referring expression by looking atthe time of the speakerrsquos utterance The timing was definedas from the ASR transcriptions that were synchronised withthe gaze and gesture data Utterances were split into in-ter pausal units (IPUs) that were separated by silent pauseslonger than 250ms (Georgeton and Meunier 2015)

122

The ASR transcriptions were used as input to the Tensor-flow syntactic parser where we extracted the part-of-speech(POS) tags Similarly to (Gross et al 2017) as linguisticindicators we used full or elliptic noun-phrases such as rdquothesofardquo rdquosofardquo rdquobedroomrdquo pronouns such as rdquoitrdquo or rdquothisrdquoand rdquothatrdquo possessive pronouns such as rdquomyrdquo rdquominerdquo andspatial indexicals such as rdquohererdquo or rdquothererdquo The ID of thesalient object or location of reference was identified andsaved on the annotations In some cases there was onlyone object referred to by the speaker but there were alsocases where the speaker referred to more than one object inone utterance (ie rdquothe table and the chairsrdquo) The roles ofthe speaker and listeners varied in each expression In totalwe annotated 766 referring expressions out of 5 sessionsroughly 5 hours of recordings which was about onethird ofour corpus

5 Data processingThe variety of modalities was post-processed and analysedwhere in particular combined eye gaze and motion cap-ture provided an accurate estimation of gaze pointing in3D space We used a single-world coordinate system inmeters on both eye gaze and motion capture input streamsand we identified the gaze trajectories head pose and handgestures We started with a low-level multimodal fusionof the input sensor streams and further aggregated the datatogether with speech and language to high-level sequencesof each participantrsquos speech and visual attention Each datapoint had automatic speech transcriptions syntactic parsingof these transcriptions gesture data in the form of movingobjects and eye gaze in the form of target object or person

51 Data synchronisationWe used sound pulses from each device to sync all signalsin a session The motion capture system sent a pulse oneach frame captured while the eye tracking glasses sent 3pulses every 10 seconds The application also sent a pulseevery 10 seconds All sound signals were collected on thesame 12 channel sound card which would then mark thereference point in time for each sessionAudio signals were also captured on the soundcard there-fore we were able to identify a reference point that wouldset the start time for each recording That was the last of thesensors to start which was one of the eye tracking glassesApart from the sound pulses we used a clapperboard withreflective markers on its closed position which would markthe start and the end of each session We used that to syncthe video signals and as a safety point in case one of thesound pulses failed to start Since the application on thedisplay sent sync signals we used it to mark the separationin time of the two experimental conditions The first one(free discussion) ended when the app started which wouldlead to the second condition (task-oriented dialogue)

52 Visualisation and visual angle thresholdAs illustrated in figure 3 we calculated the eye gaze trajec-tories by combining motion capture data and eye trackingdata per frame We implemented a visualisation tool4 in

4Available at httpswwwkthseprofiledikopagematerial

WebGL to verify and evaluate the calculated eye gaze in 3dspace Using this tool we were able to qualitatively inves-tigate the sections of multimodal turn taking behaviour ofparticipants by visualising their position and gaze at objectsor persons along with their transcribed speech and audio inwav format We also used this tool to empirically definethe visual angle threshold for each eye tracker on the accu-racy of gaze targets During pilots we asked participants tolook at different objects ensuring there are angle variationsto identify the threshold of the gaze angle to the dislocationvector (between the glasses and the salient object)

53 Automatic annotation of gaze dataWe extracted eye-gaze data for all participants and all ses-sions and noticed that there were gaps between gaze datapoints quite often and in some sessions more than in othersWe applied smoothing to eliminate outliers and interpolatedgaps in a 12 frame step (100ms on our 120 fps data) Typ-ically an eye fixation is 250ms but no smaller than 100ms(Rayner 1995) which defined our smoothing strategy Thedata was then filtered in the same time window to only pro-vide data points with fixations and not include saccades oreye blinksThere were however gaps that were longer than 12 framescaused by lack of data from the eye trackers In suchcases the eye trackers had no information on the eyesrsquo po-sitions which means that there was no knowledge on if aspeakerlistener looked at the referent object In such casesof no gaze data we went through the manually annotatedreferring expressions and checked the relevant frames ofthe gaze data If at least one of the participants had no gazedata we would discard the relevant referring expressionfrom our analysis After cleaning those cases we had re-maining 582 expressions out of the 766 initially annotated5

The average error of gaze data loss we had was 408 Thesession with the max gaze error rate was 715 while themin was 269During an referring expression participantsrsquo visual atten-tion spanned through a) a variety of visible objects on thescreen b) their interlocutors or c) none of the above whichwe assumed on this corpus to be gaze aversion The promi-nent objects of visual attention were identified by calculat-ing each participantrsquos visual angle α between their gazevector g to the dislocation vector o for every visible objecton the screen for every frame

αij = arccos

(~gij middot ~oij| ~gij | | ~oij |

)(1)

Similarly we approximated each personrsquos head with asphere with radius of 02m and automatically annotated allframes where the gaze vector intersected the sphere as vi-sual attention towards that personFinally after filtering for eye fixations we calculated theproportional gaze likelihood per object

P (oi | ti) =c(oi ti)

c(ti)(2)

5The 5 sessions chosen for annotation were the ones with thesmallest percentage of gaze data loss

123

For each object oi we gathered the proportional gaze datapoints and counted the amount of time the object was gazedduring the time t of an utterance As a gaze target we de-fined the area around any object that is within the thresh-old of the visual angle defined above In many cases morethan one objects competed for saliency in the same gazetarget area We therefore defined as joint attention not thegroup of objects gazed by the interlocutors but the gaze atthe same area around the prominent object given the visualangle threshold

6 Results

In the current section we describe gaze patterns as ob-served between and across the two conditions and propor-tional gaze during and before referring expressions In fig-ure 4 we show the proportional eye-gaze of the moderatoras well as the participantsrsquo eye gaze on the display As canbe observed the participants spent more time gazing on thedisplay than at the moderator or at each other which wasexpected to observe during a task-oriented dialogue

Figure 4 Proportional amount of participant eye-gaze persession on the display during the second condition (task-oriented dialogue)

For each utterance a set of prominent objects was definedfrom the annotations Given the gaze targets per partici-pant we calculated the proportional gaze from the speakerand the listeners during the time of the utterance and ex-actly 1 second before the utterance Since all interactionswere triadic there were two listeners and one speaker at alltimes To compare across all utterances we looked at themean proportional gaze of the two listeners to the area ofthe prominent object to define them as the listener groupWe then compared the gaze of the listener group to thespeakers gaze close to the referent objects We also lookedat the proportional combined gaze to other objects duringthe utterance (all other objects that have been gazed at dur-ing the utterance) gaze at the speaker and averted gaze Infigure 5 the blue colour refers to the proximity area of thereferent object from the speakerrsquos utterance while orangerefers to the gaze at all other known objects combined onthe virtual environment Grey is for the gaze to the listenersor the speaker during the utterance and yellow for avertedgaze

61 Eye gaze during referring expressionsWe looked at all references to objects (N = 582) and com-pared the means of the proportional gaze of the speaker tothe proportional listenersrsquo gaze to the proximity area of thereferent objects the rest of the gazed objects and gaze to-wards each other We conducted paired sample t-tests andfound significant difference between the speaker gaze to theproximity area to the referent object (M = 4669 StdError= 137) against the listener gaze on the area around the ref-erent object (M = 3945 StdError = 111) with [t = 4942p = 0001] There was no significant difference howeveron the speakerrsquos gaze to the listeners against the gaze fromthe listeners to the speaker [t = -0816 p = 0415] (mutualgaze)There was also a significant difference on the proportionalgaze of the speaker to the area around the referent object(M = 4669 StdError = 137) against the proportional gazeto other objects during the same utterances (M = 3241StdError = 125) [t = 5887 p = 0001] However nosignificant difference was found on the same case for thelistenersrsquo gaze [t = -0767 p = 0444]

62 Eye gaze before referring expressionsPrevious studies have revealed that speakers typically lookat referent objects about 800-1000ms before mentioningthem (Staudte and Crocker 2011) We therefore lookedat 1 second before the utterance for all utterances (N =582) and the gaze proportions around the area to the ob-ject(s) that were about the be referred to We compared thespeakerrsquos gaze to the referent objects to the listenersrsquo gazeusing paired sample t-tests The speakerrsquos gaze (M = 4972StdError = 152) was different than the listenerrsquos gaze (M= 2853 StdError = 118) [t = 12928 p = 0001] No sig-nificant difference was found on the mutual gaze during thetime before the referring expression [t = -1421 p = 0156]There was also a significant difference on the proportionalgaze of the speaker to the referent object (M = 4972StdError = 152) towards gaze on other objects (M =3307 StdError = 137) [t = 6154 p = 0001]Finally we compared the proportional gaze of the speakerto the referent object during the utterance and in the 1 sec-ond period before the utterance however there was no sig-nificant difference [t = -1854 p = 0064] There was adifference however as expected on the listeners gaze dur-ing (M = 3945 StdError = 111) and before (M = 285309StdError = 118) the utterance [t = 9745 p = 0001]

63 Eye gaze timingFurther we investigated when each interlocutor turned theirgaze to the object that was referred to verbally In 450 out ofthe 582 cases the speaker was already looking at the objectin a time window of 1 second prior to their verbal referenceto it On average during the 1s window they turned theirgaze to the object 0796s before uttering the referring ex-pression [t = 83157 p = 0001] which is supported by theliteratureAt least one of the listeners looked at the referent objectsduring the speakerrsquos utterance in 537 out of the 582 casesOn average they started looking at the proximity area of the

124

Figure 5 Mean proportional gaze per referring expressionfor speaker and listeners a) The speakerrsquos gaze during thetime of the reference b) The combined listenersrsquo gaze dur-ing the speakerrsquos referring expression c) The speakerrsquosgaze during the 1 second period before the reference d)The combined listenersrsquo gaze during the 1 second periodbefore the speakerrsquos referring expression

referent object 3445ms after the speaker started uttering areferring expression [t = 19355 p = 0001]

64 Mutual gaze and joint attentionWe extracted all occasions where the group was jointly at-tending the area around the referent object but also the oc-casions where either the speaker or the none of the listenerslooked at the object table 1

N = 582 BU [-10] DU [0t]None 58 21Both 339 479

Speaker only 111 24Listeners only 74 58

Table 1 Joint attention to the referent object area before(BU) and during (DU) speakerrsquos references

During the preliminary analysis of the corpus we noticedthat in many cases at least one of the listeners was alreadylooking at the area around the referent object before thespeaker would utter a referring expression (figure 5) Itwas our intuition that the salient object was already in thegrouprsquos attention before We looked at -1s before the ut-terance and automatically extracted these cases as can beseen on table 2

N = 582 BU [-1s]None 163Both 168

Speaker only 155Listeners only 96

Table 2 Count of occasions where the referent object hasalready been in focus of attention (-1s before speakerrsquos re-ferring expression)

Finally we looked at the occasions where the interlocu-tors established mutual gaze The table below shows caseswhere the speaker looked at one of the listeners or where atleast one of the listeners looked at the speaker during andbefore referring expressions Mutual gaze indicates wherethe speaker and one of the listeners look at each other andas can be seen this is very rare during or before referringexpressions

N = 582 BU [-10] DU [0t]Mutual gaze 10 20

Speaker at Listeners 52 72Listeners at Speaker 80 143

Table 3 Count of occasions where the speaker looked atthe listeners the listeners looked at the speaker and mutualgaze during and before the speakerrsquos referring expressions

7 DiscussionThe presented corpus contains recordings of a variety ofmultimodal data is processed and annotated and providesresearchers with the possibility to explore multi-modalmulti-party turn-taking behaviours in situated interactionWhile its strength lies in the rich annotation and the vari-ation in conversational dynamics it also has some limita-tionsOne of the limitations is the granularity of eye-gaze anno-tation Even though a high-end eye-gaze tracking systemwas used it was not always possible to disambiguate whichof the potential objects were being gazed at Given a visualangle threshold we are provided with a certain confidencemeasure defined by the angle difference to each object to-wards identifying the objects of attention That limited ouranalysis to the area around the object rather than a precisemeasure of the object itself Another limitation is that wedid not analyse the object movements or the participantsrsquopointing gestures These might explain some of the visualattention cases prior to the verbal referring expressionsMoreover the use of the eye tracking glasses and mo-tion capture equipment had the disadvantage that theywere quite intrusive Participants complained about fa-tigue at the end of the interactions and it was also quali-tatively observed that their gaze-head movement coordina-tion changed once wearing the glassesIn most occasions the listeners and the speakers were look-ing at the area of the referent object before the referringexpression was uttered which could potentially mean thatthe object was already on the grouprsquos attention In somecases the listenersrsquo visual attention was brought to the ob-ject area by referring language Similarly to the referredliterature this potentially shows that gaze has indicationson the salient objects in a grouprsquos discussion and that canbe used for reference resolution It is also our intuition thatobjects that establish higher fixation density during refer-ring expressions are considered to be more salient and canpotentially resolve the referencesThere are a few cases where neither the speaker nor thelistener looked at the referent objects in such cases textsaliency algorithms (Evangelopoulos et al 2009) or other

125

multimodal cues such as pointing gestures (Lucking et al2015) could be combined to resolve the reference to thesalient objectsTypically speakers direct the listenersrsquo attention usingverbal and non-verbal cues and listeners often read thespeakerrsquos visual attention during referring expressions toget indication on the referent objects As in literature wefound that speakers and listeners gazed at each other a lotduring the references to establish grounding on the referentobjects In very few cases however they also establishedmutual gaze (looking at each other at the same time) duringthose references

8 ConclusionsThe current paper presents a corpus of multi-party situatedinteraction It is fully transcribed and automatically anno-tated for eye-gaze gestures and spoken language More-over it features an automatic eye-gaze annotation methodwhere the participantrsquos gaze is resolved in 3d space a visu-alisation tool is also used to qualitatively examine parts orthe whole corpus in terms of conversational dynamics turntaking and reference resolution We annotated object ref-erences and investigated the proportional gaze from boththe perspective of the speaker and the listeners Finallywe quantitatively described the data of the corpus and gavefurther indications on how the corpus can be useful by theresearch community Both the annotated corpus and vi-sualisations are available at httpswwwkthseprofiledikopagematerial

9 AcknowledgementsThe authors would like to thank Joseph Mendelson formoderating the sessions and helping with the recordings ofthis corpus We would also like to acknowledge the supportfrom the EU Horizon 2020 project BabyRobot (687831)and the Swedish Foundation for Strategic Research projectFACT (GMT14-0082)

10 Bibliographical ReferencesAdmoni H and Scassellati B (2017) Social eye gaze in

human-robot interaction a review Journal of Human-Robot Interaction 6(1)25ndash63

Bohus D and Horvitz E (2009) Models for multipartyengagement in open-world dialog In Proceedings of theSIGDIAL 2009 Conference The 10th Annual Meeting ofthe Special Interest Group on Discourse and Dialoguepages 225ndash234 Association for Computational Linguis-tics

Bohus D and Horvitz E (2010) Facilitating multi-party dialog with gaze gesture and speech In Inter-national Conference on Multimodal Interfaces and theWorkshop on Machine Learning for Multimodal Interac-tion page 5 ACM

Bolt R A (1980) rsquoPut-that-therersquo Voice and gesture atthe graphics interface volume 14 ACM

Borji A and Itti L (2013) State-of-the-art in visual at-tention modeling IEEE transactions on pattern analysisand machine intelligence 35(1)185ndash207

Bruce N D and Tsotsos J K (2009) Saliency attentionand visual search An information theoretic approachJournal of vision 9(3)5ndash5

Carletta J (2007) Unleashing the killer corpus experi-ences in creating the multi-everything ami meeting cor-pus Language Resources and Evaluation 41(2)181ndash190

Evangelopoulos G Zlatintsi A Skoumas G Ra-pantzikos K Potamianos A Maragos P and AvrithisY (2009) Video event detection and summarization us-ing audio visual and text saliency In Acoustics Speechand Signal Processing 2009 ICASSP 2009 IEEE Inter-national Conference on pages 3553ndash3556 IEEE

Georgeton L and Meunier C (2015) Spontaneousspeech production by dysarthric and healthy speakersTemporal organisation and speaking rate In 18th Inter-national Congress of Phonetic Sciences Glasgow UKsubmitted

Goldberg L R (1992) The development of markers forthe big-five factor structure Psychological assessment4(1)26

Gross S Krenn B and Scheutz M (2017) The re-liability of non-verbal cues for situated reference res-olution and their interplay with language implicationsfor human robot interaction In Proceedings of the 19thACM International Conference on Multimodal Interac-tion pages 189ndash196 ACM

Grosz B J and Sidner C L (1986) Attention intentionsand the structure of discourse Computational linguis-tics 12(3)175ndash204

Hung H and Chittaranjan G (2010) The idiap wolf cor-pus exploring group behaviour in a competitive role-playing game In Proceedings of the 18th ACM interna-tional conference on Multimedia pages 879ndash882 ACM

Kennington C Kousidis S and Schlangen D (2013)Interpreting situated dialogue utterances an updatemodel that uses speech gaze and gesture informationProceedings of SIGdial 2013

Lucking A Bergmann K Hahn F Kopp S and RieserH (2010) The bielefeld speech and gesture alignmentcorpus (saga) In LREC 2010 workshop Multimodalcorporandashadvances in capturing coding and analyzingmultimodality

Lucking A Pfeiffer T and Rieser H (2015) Point-ing and reference reconsidered Journal of Pragmatics7756ndash79

Meena R Skantze G and Gustafson J (2014) Data-driven models for timing feedback responses in a maptask dialogue system Computer Speech amp Language28(4)903ndash922

Mehlmann G Haring M Janowski K Baur T Geb-hard P and Andre E (2014) Exploring a model ofgaze for grounding in multimodal hri In Proceedings ofthe 16th International Conference on Multimodal Inter-action pages 247ndash254 ACM

Mostefa D Moreau N Choukri K Potamianos GChu S M Tyagi A Casas J R Turmo J Cristofore-tti L Tobia F et al (2007) The chil audiovisual cor-

126

pus for lecture and meeting analysis inside smart roomsLanguage Resources and Evaluation 41(3-4)389ndash407

Oertel C Cummins F Edlund J Wagner P and Camp-bell N (2013) D64 A corpus of richly recorded con-versational interaction Journal on Multimodal User In-terfaces 7(1-2)19ndash28

Oertel C Funes Mora K A Sheikhi S Odobez J-Mand Gustafson J (2014) Who will get the grant Amultimodal corpus for the analysis of conversational be-haviours in group interviews In Proceedings of the 2014Workshop on Understanding and Modeling MultipartyMultimodal Interactions pages 27ndash32 ACM

Prasov Z and Chai J Y (2008) Whatrsquos in a gaze therole of eye-gaze in reference resolution in multimodalconversational interfaces In Proceedings of the 13thinternational conference on Intelligent user interfacespages 20ndash29 ACM

Prasov Z and Chai J Y (2010) Fusing eye gaze withspeech recognition hypotheses to resolve exophoric ref-erences in situated dialogue In Proceedings of the 2010Conference on Empirical Methods in Natural LanguageProcessing pages 471ndash481 Association for Computa-tional Linguistics

Rayner K (1995) Eye movements and cognitive pro-cesses in reading visual search and scene perceptionIn Studies in visual information processing volume 6pages 3ndash22 Elsevier

Schlangen D and Skantze G (2009) A general abstractmodel of incremental dialogue processing In Proceed-ings of the 12th Conference of the European Chapterof the Association for Computational Linguistics pages710ndash718 Association for Computational Linguistics

Sheikhi S and Odobez J-M (2012) Recognizing the vi-sual focus of attention for human robot interaction In In-ternational Workshop on Human Behavior Understand-ing pages 99ndash112 Springer

Skantze G Johansson M and Beskow J (2015) Ex-ploring turn-taking cues in multi-party human-robot dis-cussions about objects In Proceedings of the 2015 ACMon International Conference on Multimodal Interactionpages 67ndash74 ACM

Staudte M and Crocker M W (2011) Investigating jointattention mechanisms through spoken humanndashrobot in-teraction Cognition 120(2)268ndash291

Stefanov K and Beskow J (2016) A multi-party multi-modal dataset for focus of visual attention in human-human and human-robot interaction In Proceedings ofthe 10th edition of the Language Resources and Evalua-tion Conference (LREC 2016 23-28 of May ELRA

Sziklai G (1956) Some studies in the speed of visualperception IRE Transactions on Information Theory2(3)125ndash128

Winograd T (1972) Understanding natural languageCognitive psychology 3(1)1ndash191

127

  • Introduction
  • Background
    • Multiparty multimodal corpora
    • Social eye-gaze
    • Multimodal reference resolution
      • Data collection
        • Scenario
        • Participants
        • Corpus
        • Experimental setup
        • Motion capture
        • Eye gaze
        • Speech and language
        • Facial expressions
        • Application
        • Subjective measures
        • Calibration
          • Annotations
          • Data processing
            • Data synchronisation
            • Visualisation and visual angle threshold
            • Automatic annotation of gaze data
              • Results
                • Eye gaze during referring expressions
                • Eye gaze before referring expressions
                • Eye gaze timing
                • Mutual gaze and joint attention
                  • Discussion
                  • Conclusions
                  • Acknowledgements
                  • Bibliographical References
Page 4: A Multimodal Corpus for Mutual Gaze and Joint …of-the-art in multimodal multiparty corpora and relevant works in various types social eye gaze. We further describe our process in

where the trajectories from the two eyes meet That pointin space (x y z) we used to resolve the eye gaze trajectoryin 3d space

Figure 3 By combining synchronised data from motioncapture and eye tracking glasses we captured the partici-pantsrsquo eye gaze in 3d space but also head pose and gestures

37 Speech and languageWe recorded audio from each participant using channel sep-arated lavalier microphones attached to the eye trackingglasses Each microphone captured speech that we laterused to automatically transcribe using speech recognitionand resolve the spoken utterances to text We used a voiceactivity detection filter in all channels to separate capturedspeech from other speakers We further used Tensorflowrsquos3

neural network parser to form syntactic representations ofthe automatically transcribed natural language text

38 Facial expressionsIn order to extract facial expressions we placed GoPro cam-eras in front of the participants and moderator on the tableWe also recorded each session from a camera placed on atripod further in the room to capture the interaction as awhole and for later usage in annotating the corpus

39 ApplicationWe built an application to enable the interaction with sev-eral virtual objects (figure 2) The application consisted oftwo main views a floor plan and a furniture shop The floorplan displayed on the large multi-touch display was used bythe participants during their interactions as the main area tomanipulate the objects while the shops were used to collectnew objectsThe shop screens were divided by room categories such askitchen or living room They naturally induced deictic ex-pressions as the participants referred to the objects duringtheir negotiations using both language and referential gaze(ie rdquoitrdquo the deskrdquo rdquothe bedrdquo) The floor plan on the con-trary was the main display were they could manipulate ob-jects by placing them in rooms and used referring languageto these virtual locations (ie rdquohererdquo rdquothererdquo rdquomy roomrdquo)

3httpswwwtensoroworgversionsr012tutorialssyntaxnet

The floor plan view also displayed the apartment scale andthe budget the participants could spend for objects Thebudget was limited and after initial pilots it was decidedto have a value that would be enough to satisfy both partici-pantsrsquo furniture selections but also limited enough to fosternegotiations on what objects they would select They werealso allowed to only choose one of each item in order tofoster negotiations furtherThe application maintained event-based logging at 5 fps tokeep track of the interaction flow object movement and al-locations Each event carried time stamps and relevant ap-plication state as well as positions and rotation data for allobjects By placing markers in the corners of the screenit could be placed into the motion capture coordinate sys-tem This allowed us to get the virtual object events in 3Dspace and capture the participantsrsquo visual attention to theseobjects

310 Subjective measuresAt the end of each session we gave participants a question-naire to measure their impression on how the discussionwent and how well they thought they had collaborated whendecorating the appartment We used these measures to mea-sure dominance as well as collaborative behaviour and per-sonality All participants were asked to fill personality testsbefore coming to the lab the tests included introversion andextroversion measures (Goldberg 1992)

311 CalibrationThe sensors we used required calibration in order to suc-cessfully capture motion and eye movements We cali-brated all 17 cameras positioning at the beginning of allrecordings while the eye tracking glasses required calibra-tion on each subject separately That is due to the fact thateach participantrsquos eye positions vary but also how their eyesmove during saccades

4 AnnotationsWe annotated referring expressions to objects on the dis-play by looking at what object the speaker intended to referto Speakers typically drew their interlocutorsrsquo attention toobjects using deictic expressions referring language refer-ential gaze as well as mutual gaze For every dialogue therewas a set of references to objects not included in our simu-lated environment -rdquoI also have a fireplace in my flat but Ido not use it a lotrdquo However we only annotated referencesthat can be resolved to objects existing in our simulated en-vironment ie the application on the touch display Thereferences we can resolve in this environment are thereforeonly a subset of all possible references in the dialogueFor the annotations we used the videos of the interactionstogether with the ASR transcriptions and the current stateof the app (visible objects or current screen on the large dis-play) We defined each referring expression by looking atthe time of the speakerrsquos utterance The timing was definedas from the ASR transcriptions that were synchronised withthe gaze and gesture data Utterances were split into in-ter pausal units (IPUs) that were separated by silent pauseslonger than 250ms (Georgeton and Meunier 2015)

122

The ASR transcriptions were used as input to the Tensor-flow syntactic parser where we extracted the part-of-speech(POS) tags Similarly to (Gross et al 2017) as linguisticindicators we used full or elliptic noun-phrases such as rdquothesofardquo rdquosofardquo rdquobedroomrdquo pronouns such as rdquoitrdquo or rdquothisrdquoand rdquothatrdquo possessive pronouns such as rdquomyrdquo rdquominerdquo andspatial indexicals such as rdquohererdquo or rdquothererdquo The ID of thesalient object or location of reference was identified andsaved on the annotations In some cases there was onlyone object referred to by the speaker but there were alsocases where the speaker referred to more than one object inone utterance (ie rdquothe table and the chairsrdquo) The roles ofthe speaker and listeners varied in each expression In totalwe annotated 766 referring expressions out of 5 sessionsroughly 5 hours of recordings which was about onethird ofour corpus

5 Data processingThe variety of modalities was post-processed and analysedwhere in particular combined eye gaze and motion cap-ture provided an accurate estimation of gaze pointing in3D space We used a single-world coordinate system inmeters on both eye gaze and motion capture input streamsand we identified the gaze trajectories head pose and handgestures We started with a low-level multimodal fusionof the input sensor streams and further aggregated the datatogether with speech and language to high-level sequencesof each participantrsquos speech and visual attention Each datapoint had automatic speech transcriptions syntactic parsingof these transcriptions gesture data in the form of movingobjects and eye gaze in the form of target object or person

51 Data synchronisationWe used sound pulses from each device to sync all signalsin a session The motion capture system sent a pulse oneach frame captured while the eye tracking glasses sent 3pulses every 10 seconds The application also sent a pulseevery 10 seconds All sound signals were collected on thesame 12 channel sound card which would then mark thereference point in time for each sessionAudio signals were also captured on the soundcard there-fore we were able to identify a reference point that wouldset the start time for each recording That was the last of thesensors to start which was one of the eye tracking glassesApart from the sound pulses we used a clapperboard withreflective markers on its closed position which would markthe start and the end of each session We used that to syncthe video signals and as a safety point in case one of thesound pulses failed to start Since the application on thedisplay sent sync signals we used it to mark the separationin time of the two experimental conditions The first one(free discussion) ended when the app started which wouldlead to the second condition (task-oriented dialogue)

52 Visualisation and visual angle thresholdAs illustrated in figure 3 we calculated the eye gaze trajec-tories by combining motion capture data and eye trackingdata per frame We implemented a visualisation tool4 in

4Available at httpswwwkthseprofiledikopagematerial

WebGL to verify and evaluate the calculated eye gaze in 3dspace Using this tool we were able to qualitatively inves-tigate the sections of multimodal turn taking behaviour ofparticipants by visualising their position and gaze at objectsor persons along with their transcribed speech and audio inwav format We also used this tool to empirically definethe visual angle threshold for each eye tracker on the accu-racy of gaze targets During pilots we asked participants tolook at different objects ensuring there are angle variationsto identify the threshold of the gaze angle to the dislocationvector (between the glasses and the salient object)

53 Automatic annotation of gaze dataWe extracted eye-gaze data for all participants and all ses-sions and noticed that there were gaps between gaze datapoints quite often and in some sessions more than in othersWe applied smoothing to eliminate outliers and interpolatedgaps in a 12 frame step (100ms on our 120 fps data) Typ-ically an eye fixation is 250ms but no smaller than 100ms(Rayner 1995) which defined our smoothing strategy Thedata was then filtered in the same time window to only pro-vide data points with fixations and not include saccades oreye blinksThere were however gaps that were longer than 12 framescaused by lack of data from the eye trackers In suchcases the eye trackers had no information on the eyesrsquo po-sitions which means that there was no knowledge on if aspeakerlistener looked at the referent object In such casesof no gaze data we went through the manually annotatedreferring expressions and checked the relevant frames ofthe gaze data If at least one of the participants had no gazedata we would discard the relevant referring expressionfrom our analysis After cleaning those cases we had re-maining 582 expressions out of the 766 initially annotated5

The average error of gaze data loss we had was 408 Thesession with the max gaze error rate was 715 while themin was 269During an referring expression participantsrsquo visual atten-tion spanned through a) a variety of visible objects on thescreen b) their interlocutors or c) none of the above whichwe assumed on this corpus to be gaze aversion The promi-nent objects of visual attention were identified by calculat-ing each participantrsquos visual angle α between their gazevector g to the dislocation vector o for every visible objecton the screen for every frame

αij = arccos

(~gij middot ~oij| ~gij | | ~oij |

)(1)

Similarly we approximated each personrsquos head with asphere with radius of 02m and automatically annotated allframes where the gaze vector intersected the sphere as vi-sual attention towards that personFinally after filtering for eye fixations we calculated theproportional gaze likelihood per object

P (oi | ti) =c(oi ti)

c(ti)(2)

5The 5 sessions chosen for annotation were the ones with thesmallest percentage of gaze data loss

123

For each object oi we gathered the proportional gaze datapoints and counted the amount of time the object was gazedduring the time t of an utterance As a gaze target we de-fined the area around any object that is within the thresh-old of the visual angle defined above In many cases morethan one objects competed for saliency in the same gazetarget area We therefore defined as joint attention not thegroup of objects gazed by the interlocutors but the gaze atthe same area around the prominent object given the visualangle threshold

6 Results

In the current section we describe gaze patterns as ob-served between and across the two conditions and propor-tional gaze during and before referring expressions In fig-ure 4 we show the proportional eye-gaze of the moderatoras well as the participantsrsquo eye gaze on the display As canbe observed the participants spent more time gazing on thedisplay than at the moderator or at each other which wasexpected to observe during a task-oriented dialogue

Figure 4 Proportional amount of participant eye-gaze persession on the display during the second condition (task-oriented dialogue)

For each utterance a set of prominent objects was definedfrom the annotations Given the gaze targets per partici-pant we calculated the proportional gaze from the speakerand the listeners during the time of the utterance and ex-actly 1 second before the utterance Since all interactionswere triadic there were two listeners and one speaker at alltimes To compare across all utterances we looked at themean proportional gaze of the two listeners to the area ofthe prominent object to define them as the listener groupWe then compared the gaze of the listener group to thespeakers gaze close to the referent objects We also lookedat the proportional combined gaze to other objects duringthe utterance (all other objects that have been gazed at dur-ing the utterance) gaze at the speaker and averted gaze Infigure 5 the blue colour refers to the proximity area of thereferent object from the speakerrsquos utterance while orangerefers to the gaze at all other known objects combined onthe virtual environment Grey is for the gaze to the listenersor the speaker during the utterance and yellow for avertedgaze

61 Eye gaze during referring expressionsWe looked at all references to objects (N = 582) and com-pared the means of the proportional gaze of the speaker tothe proportional listenersrsquo gaze to the proximity area of thereferent objects the rest of the gazed objects and gaze to-wards each other We conducted paired sample t-tests andfound significant difference between the speaker gaze to theproximity area to the referent object (M = 4669 StdError= 137) against the listener gaze on the area around the ref-erent object (M = 3945 StdError = 111) with [t = 4942p = 0001] There was no significant difference howeveron the speakerrsquos gaze to the listeners against the gaze fromthe listeners to the speaker [t = -0816 p = 0415] (mutualgaze)There was also a significant difference on the proportionalgaze of the speaker to the area around the referent object(M = 4669 StdError = 137) against the proportional gazeto other objects during the same utterances (M = 3241StdError = 125) [t = 5887 p = 0001] However nosignificant difference was found on the same case for thelistenersrsquo gaze [t = -0767 p = 0444]

62 Eye gaze before referring expressionsPrevious studies have revealed that speakers typically lookat referent objects about 800-1000ms before mentioningthem (Staudte and Crocker 2011) We therefore lookedat 1 second before the utterance for all utterances (N =582) and the gaze proportions around the area to the ob-ject(s) that were about the be referred to We compared thespeakerrsquos gaze to the referent objects to the listenersrsquo gazeusing paired sample t-tests The speakerrsquos gaze (M = 4972StdError = 152) was different than the listenerrsquos gaze (M= 2853 StdError = 118) [t = 12928 p = 0001] No sig-nificant difference was found on the mutual gaze during thetime before the referring expression [t = -1421 p = 0156]There was also a significant difference on the proportionalgaze of the speaker to the referent object (M = 4972StdError = 152) towards gaze on other objects (M =3307 StdError = 137) [t = 6154 p = 0001]Finally we compared the proportional gaze of the speakerto the referent object during the utterance and in the 1 sec-ond period before the utterance however there was no sig-nificant difference [t = -1854 p = 0064] There was adifference however as expected on the listeners gaze dur-ing (M = 3945 StdError = 111) and before (M = 285309StdError = 118) the utterance [t = 9745 p = 0001]

63 Eye gaze timingFurther we investigated when each interlocutor turned theirgaze to the object that was referred to verbally In 450 out ofthe 582 cases the speaker was already looking at the objectin a time window of 1 second prior to their verbal referenceto it On average during the 1s window they turned theirgaze to the object 0796s before uttering the referring ex-pression [t = 83157 p = 0001] which is supported by theliteratureAt least one of the listeners looked at the referent objectsduring the speakerrsquos utterance in 537 out of the 582 casesOn average they started looking at the proximity area of the

124

Figure 5 Mean proportional gaze per referring expressionfor speaker and listeners a) The speakerrsquos gaze during thetime of the reference b) The combined listenersrsquo gaze dur-ing the speakerrsquos referring expression c) The speakerrsquosgaze during the 1 second period before the reference d)The combined listenersrsquo gaze during the 1 second periodbefore the speakerrsquos referring expression

referent object 3445ms after the speaker started uttering areferring expression [t = 19355 p = 0001]

64 Mutual gaze and joint attentionWe extracted all occasions where the group was jointly at-tending the area around the referent object but also the oc-casions where either the speaker or the none of the listenerslooked at the object table 1

N = 582 BU [-10] DU [0t]None 58 21Both 339 479

Speaker only 111 24Listeners only 74 58

Table 1 Joint attention to the referent object area before(BU) and during (DU) speakerrsquos references

During the preliminary analysis of the corpus we noticedthat in many cases at least one of the listeners was alreadylooking at the area around the referent object before thespeaker would utter a referring expression (figure 5) Itwas our intuition that the salient object was already in thegrouprsquos attention before We looked at -1s before the ut-terance and automatically extracted these cases as can beseen on table 2

N = 582 BU [-1s]None 163Both 168

Speaker only 155Listeners only 96

Table 2 Count of occasions where the referent object hasalready been in focus of attention (-1s before speakerrsquos re-ferring expression)

Finally we looked at the occasions where the interlocu-tors established mutual gaze The table below shows caseswhere the speaker looked at one of the listeners or where atleast one of the listeners looked at the speaker during andbefore referring expressions Mutual gaze indicates wherethe speaker and one of the listeners look at each other andas can be seen this is very rare during or before referringexpressions

N = 582 BU [-10] DU [0t]Mutual gaze 10 20

Speaker at Listeners 52 72Listeners at Speaker 80 143

Table 3 Count of occasions where the speaker looked atthe listeners the listeners looked at the speaker and mutualgaze during and before the speakerrsquos referring expressions

7 DiscussionThe presented corpus contains recordings of a variety ofmultimodal data is processed and annotated and providesresearchers with the possibility to explore multi-modalmulti-party turn-taking behaviours in situated interactionWhile its strength lies in the rich annotation and the vari-ation in conversational dynamics it also has some limita-tionsOne of the limitations is the granularity of eye-gaze anno-tation Even though a high-end eye-gaze tracking systemwas used it was not always possible to disambiguate whichof the potential objects were being gazed at Given a visualangle threshold we are provided with a certain confidencemeasure defined by the angle difference to each object to-wards identifying the objects of attention That limited ouranalysis to the area around the object rather than a precisemeasure of the object itself Another limitation is that wedid not analyse the object movements or the participantsrsquopointing gestures These might explain some of the visualattention cases prior to the verbal referring expressionsMoreover the use of the eye tracking glasses and mo-tion capture equipment had the disadvantage that theywere quite intrusive Participants complained about fa-tigue at the end of the interactions and it was also quali-tatively observed that their gaze-head movement coordina-tion changed once wearing the glassesIn most occasions the listeners and the speakers were look-ing at the area of the referent object before the referringexpression was uttered which could potentially mean thatthe object was already on the grouprsquos attention In somecases the listenersrsquo visual attention was brought to the ob-ject area by referring language Similarly to the referredliterature this potentially shows that gaze has indicationson the salient objects in a grouprsquos discussion and that canbe used for reference resolution It is also our intuition thatobjects that establish higher fixation density during refer-ring expressions are considered to be more salient and canpotentially resolve the referencesThere are a few cases where neither the speaker nor thelistener looked at the referent objects in such cases textsaliency algorithms (Evangelopoulos et al 2009) or other

125

multimodal cues such as pointing gestures (Lucking et al2015) could be combined to resolve the reference to thesalient objectsTypically speakers direct the listenersrsquo attention usingverbal and non-verbal cues and listeners often read thespeakerrsquos visual attention during referring expressions toget indication on the referent objects As in literature wefound that speakers and listeners gazed at each other a lotduring the references to establish grounding on the referentobjects In very few cases however they also establishedmutual gaze (looking at each other at the same time) duringthose references

8 ConclusionsThe current paper presents a corpus of multi-party situatedinteraction It is fully transcribed and automatically anno-tated for eye-gaze gestures and spoken language More-over it features an automatic eye-gaze annotation methodwhere the participantrsquos gaze is resolved in 3d space a visu-alisation tool is also used to qualitatively examine parts orthe whole corpus in terms of conversational dynamics turntaking and reference resolution We annotated object ref-erences and investigated the proportional gaze from boththe perspective of the speaker and the listeners Finallywe quantitatively described the data of the corpus and gavefurther indications on how the corpus can be useful by theresearch community Both the annotated corpus and vi-sualisations are available at httpswwwkthseprofiledikopagematerial

9 AcknowledgementsThe authors would like to thank Joseph Mendelson formoderating the sessions and helping with the recordings ofthis corpus We would also like to acknowledge the supportfrom the EU Horizon 2020 project BabyRobot (687831)and the Swedish Foundation for Strategic Research projectFACT (GMT14-0082)

10 Bibliographical ReferencesAdmoni H and Scassellati B (2017) Social eye gaze in

human-robot interaction a review Journal of Human-Robot Interaction 6(1)25ndash63

Bohus D and Horvitz E (2009) Models for multipartyengagement in open-world dialog In Proceedings of theSIGDIAL 2009 Conference The 10th Annual Meeting ofthe Special Interest Group on Discourse and Dialoguepages 225ndash234 Association for Computational Linguis-tics

Bohus D and Horvitz E (2010) Facilitating multi-party dialog with gaze gesture and speech In Inter-national Conference on Multimodal Interfaces and theWorkshop on Machine Learning for Multimodal Interac-tion page 5 ACM

Bolt R A (1980) rsquoPut-that-therersquo Voice and gesture atthe graphics interface volume 14 ACM

Borji A and Itti L (2013) State-of-the-art in visual at-tention modeling IEEE transactions on pattern analysisand machine intelligence 35(1)185ndash207

Bruce N D and Tsotsos J K (2009) Saliency attentionand visual search An information theoretic approachJournal of vision 9(3)5ndash5

Carletta J (2007) Unleashing the killer corpus experi-ences in creating the multi-everything ami meeting cor-pus Language Resources and Evaluation 41(2)181ndash190

Evangelopoulos G Zlatintsi A Skoumas G Ra-pantzikos K Potamianos A Maragos P and AvrithisY (2009) Video event detection and summarization us-ing audio visual and text saliency In Acoustics Speechand Signal Processing 2009 ICASSP 2009 IEEE Inter-national Conference on pages 3553ndash3556 IEEE

Georgeton L and Meunier C (2015) Spontaneousspeech production by dysarthric and healthy speakersTemporal organisation and speaking rate In 18th Inter-national Congress of Phonetic Sciences Glasgow UKsubmitted

Goldberg L R (1992) The development of markers forthe big-five factor structure Psychological assessment4(1)26

Gross S Krenn B and Scheutz M (2017) The re-liability of non-verbal cues for situated reference res-olution and their interplay with language implicationsfor human robot interaction In Proceedings of the 19thACM International Conference on Multimodal Interac-tion pages 189ndash196 ACM

Grosz B J and Sidner C L (1986) Attention intentionsand the structure of discourse Computational linguis-tics 12(3)175ndash204

Hung H and Chittaranjan G (2010) The idiap wolf cor-pus exploring group behaviour in a competitive role-playing game In Proceedings of the 18th ACM interna-tional conference on Multimedia pages 879ndash882 ACM

Kennington C Kousidis S and Schlangen D (2013)Interpreting situated dialogue utterances an updatemodel that uses speech gaze and gesture informationProceedings of SIGdial 2013

Lucking A Bergmann K Hahn F Kopp S and RieserH (2010) The bielefeld speech and gesture alignmentcorpus (saga) In LREC 2010 workshop Multimodalcorporandashadvances in capturing coding and analyzingmultimodality

Lucking A Pfeiffer T and Rieser H (2015) Point-ing and reference reconsidered Journal of Pragmatics7756ndash79

Meena R Skantze G and Gustafson J (2014) Data-driven models for timing feedback responses in a maptask dialogue system Computer Speech amp Language28(4)903ndash922

Mehlmann G Haring M Janowski K Baur T Geb-hard P and Andre E (2014) Exploring a model ofgaze for grounding in multimodal hri In Proceedings ofthe 16th International Conference on Multimodal Inter-action pages 247ndash254 ACM

Mostefa D Moreau N Choukri K Potamianos GChu S M Tyagi A Casas J R Turmo J Cristofore-tti L Tobia F et al (2007) The chil audiovisual cor-

126

pus for lecture and meeting analysis inside smart roomsLanguage Resources and Evaluation 41(3-4)389ndash407

Oertel C Cummins F Edlund J Wagner P and Camp-bell N (2013) D64 A corpus of richly recorded con-versational interaction Journal on Multimodal User In-terfaces 7(1-2)19ndash28

Oertel C Funes Mora K A Sheikhi S Odobez J-Mand Gustafson J (2014) Who will get the grant Amultimodal corpus for the analysis of conversational be-haviours in group interviews In Proceedings of the 2014Workshop on Understanding and Modeling MultipartyMultimodal Interactions pages 27ndash32 ACM

Prasov Z and Chai J Y (2008) Whatrsquos in a gaze therole of eye-gaze in reference resolution in multimodalconversational interfaces In Proceedings of the 13thinternational conference on Intelligent user interfacespages 20ndash29 ACM

Prasov Z and Chai J Y (2010) Fusing eye gaze withspeech recognition hypotheses to resolve exophoric ref-erences in situated dialogue In Proceedings of the 2010Conference on Empirical Methods in Natural LanguageProcessing pages 471ndash481 Association for Computa-tional Linguistics

Rayner K (1995) Eye movements and cognitive pro-cesses in reading visual search and scene perceptionIn Studies in visual information processing volume 6pages 3ndash22 Elsevier

Schlangen D and Skantze G (2009) A general abstractmodel of incremental dialogue processing In Proceed-ings of the 12th Conference of the European Chapterof the Association for Computational Linguistics pages710ndash718 Association for Computational Linguistics

Sheikhi S and Odobez J-M (2012) Recognizing the vi-sual focus of attention for human robot interaction In In-ternational Workshop on Human Behavior Understand-ing pages 99ndash112 Springer

Skantze G Johansson M and Beskow J (2015) Ex-ploring turn-taking cues in multi-party human-robot dis-cussions about objects In Proceedings of the 2015 ACMon International Conference on Multimodal Interactionpages 67ndash74 ACM

Staudte M and Crocker M W (2011) Investigating jointattention mechanisms through spoken humanndashrobot in-teraction Cognition 120(2)268ndash291

Stefanov K and Beskow J (2016) A multi-party multi-modal dataset for focus of visual attention in human-human and human-robot interaction In Proceedings ofthe 10th edition of the Language Resources and Evalua-tion Conference (LREC 2016 23-28 of May ELRA

Sziklai G (1956) Some studies in the speed of visualperception IRE Transactions on Information Theory2(3)125ndash128

Winograd T (1972) Understanding natural languageCognitive psychology 3(1)1ndash191

127

  • Introduction
  • Background
    • Multiparty multimodal corpora
    • Social eye-gaze
    • Multimodal reference resolution
      • Data collection
        • Scenario
        • Participants
        • Corpus
        • Experimental setup
        • Motion capture
        • Eye gaze
        • Speech and language
        • Facial expressions
        • Application
        • Subjective measures
        • Calibration
          • Annotations
          • Data processing
            • Data synchronisation
            • Visualisation and visual angle threshold
            • Automatic annotation of gaze data
              • Results
                • Eye gaze during referring expressions
                • Eye gaze before referring expressions
                • Eye gaze timing
                • Mutual gaze and joint attention
                  • Discussion
                  • Conclusions
                  • Acknowledgements
                  • Bibliographical References
Page 5: A Multimodal Corpus for Mutual Gaze and Joint …of-the-art in multimodal multiparty corpora and relevant works in various types social eye gaze. We further describe our process in

The ASR transcriptions were used as input to the Tensor-flow syntactic parser where we extracted the part-of-speech(POS) tags Similarly to (Gross et al 2017) as linguisticindicators we used full or elliptic noun-phrases such as rdquothesofardquo rdquosofardquo rdquobedroomrdquo pronouns such as rdquoitrdquo or rdquothisrdquoand rdquothatrdquo possessive pronouns such as rdquomyrdquo rdquominerdquo andspatial indexicals such as rdquohererdquo or rdquothererdquo The ID of thesalient object or location of reference was identified andsaved on the annotations In some cases there was onlyone object referred to by the speaker but there were alsocases where the speaker referred to more than one object inone utterance (ie rdquothe table and the chairsrdquo) The roles ofthe speaker and listeners varied in each expression In totalwe annotated 766 referring expressions out of 5 sessionsroughly 5 hours of recordings which was about onethird ofour corpus

5 Data processingThe variety of modalities was post-processed and analysedwhere in particular combined eye gaze and motion cap-ture provided an accurate estimation of gaze pointing in3D space We used a single-world coordinate system inmeters on both eye gaze and motion capture input streamsand we identified the gaze trajectories head pose and handgestures We started with a low-level multimodal fusionof the input sensor streams and further aggregated the datatogether with speech and language to high-level sequencesof each participantrsquos speech and visual attention Each datapoint had automatic speech transcriptions syntactic parsingof these transcriptions gesture data in the form of movingobjects and eye gaze in the form of target object or person

51 Data synchronisationWe used sound pulses from each device to sync all signalsin a session The motion capture system sent a pulse oneach frame captured while the eye tracking glasses sent 3pulses every 10 seconds The application also sent a pulseevery 10 seconds All sound signals were collected on thesame 12 channel sound card which would then mark thereference point in time for each sessionAudio signals were also captured on the soundcard there-fore we were able to identify a reference point that wouldset the start time for each recording That was the last of thesensors to start which was one of the eye tracking glassesApart from the sound pulses we used a clapperboard withreflective markers on its closed position which would markthe start and the end of each session We used that to syncthe video signals and as a safety point in case one of thesound pulses failed to start Since the application on thedisplay sent sync signals we used it to mark the separationin time of the two experimental conditions The first one(free discussion) ended when the app started which wouldlead to the second condition (task-oriented dialogue)

52 Visualisation and visual angle thresholdAs illustrated in figure 3 we calculated the eye gaze trajec-tories by combining motion capture data and eye trackingdata per frame We implemented a visualisation tool4 in

4Available at httpswwwkthseprofiledikopagematerial

WebGL to verify and evaluate the calculated eye gaze in 3dspace Using this tool we were able to qualitatively inves-tigate the sections of multimodal turn taking behaviour ofparticipants by visualising their position and gaze at objectsor persons along with their transcribed speech and audio inwav format We also used this tool to empirically definethe visual angle threshold for each eye tracker on the accu-racy of gaze targets During pilots we asked participants tolook at different objects ensuring there are angle variationsto identify the threshold of the gaze angle to the dislocationvector (between the glasses and the salient object)

53 Automatic annotation of gaze dataWe extracted eye-gaze data for all participants and all ses-sions and noticed that there were gaps between gaze datapoints quite often and in some sessions more than in othersWe applied smoothing to eliminate outliers and interpolatedgaps in a 12 frame step (100ms on our 120 fps data) Typ-ically an eye fixation is 250ms but no smaller than 100ms(Rayner 1995) which defined our smoothing strategy Thedata was then filtered in the same time window to only pro-vide data points with fixations and not include saccades oreye blinksThere were however gaps that were longer than 12 framescaused by lack of data from the eye trackers In suchcases the eye trackers had no information on the eyesrsquo po-sitions which means that there was no knowledge on if aspeakerlistener looked at the referent object In such casesof no gaze data we went through the manually annotatedreferring expressions and checked the relevant frames ofthe gaze data If at least one of the participants had no gazedata we would discard the relevant referring expressionfrom our analysis After cleaning those cases we had re-maining 582 expressions out of the 766 initially annotated5

The average error of gaze data loss we had was 408 Thesession with the max gaze error rate was 715 while themin was 269During an referring expression participantsrsquo visual atten-tion spanned through a) a variety of visible objects on thescreen b) their interlocutors or c) none of the above whichwe assumed on this corpus to be gaze aversion The promi-nent objects of visual attention were identified by calculat-ing each participantrsquos visual angle α between their gazevector g to the dislocation vector o for every visible objecton the screen for every frame

αij = arccos

(~gij middot ~oij| ~gij | | ~oij |

)(1)

Similarly we approximated each personrsquos head with asphere with radius of 02m and automatically annotated allframes where the gaze vector intersected the sphere as vi-sual attention towards that personFinally after filtering for eye fixations we calculated theproportional gaze likelihood per object

P (oi | ti) =c(oi ti)

c(ti)(2)

5The 5 sessions chosen for annotation were the ones with thesmallest percentage of gaze data loss

123

For each object oi we gathered the proportional gaze datapoints and counted the amount of time the object was gazedduring the time t of an utterance As a gaze target we de-fined the area around any object that is within the thresh-old of the visual angle defined above In many cases morethan one objects competed for saliency in the same gazetarget area We therefore defined as joint attention not thegroup of objects gazed by the interlocutors but the gaze atthe same area around the prominent object given the visualangle threshold

6 Results

In the current section we describe gaze patterns as ob-served between and across the two conditions and propor-tional gaze during and before referring expressions In fig-ure 4 we show the proportional eye-gaze of the moderatoras well as the participantsrsquo eye gaze on the display As canbe observed the participants spent more time gazing on thedisplay than at the moderator or at each other which wasexpected to observe during a task-oriented dialogue

Figure 4 Proportional amount of participant eye-gaze persession on the display during the second condition (task-oriented dialogue)

For each utterance a set of prominent objects was definedfrom the annotations Given the gaze targets per partici-pant we calculated the proportional gaze from the speakerand the listeners during the time of the utterance and ex-actly 1 second before the utterance Since all interactionswere triadic there were two listeners and one speaker at alltimes To compare across all utterances we looked at themean proportional gaze of the two listeners to the area ofthe prominent object to define them as the listener groupWe then compared the gaze of the listener group to thespeakers gaze close to the referent objects We also lookedat the proportional combined gaze to other objects duringthe utterance (all other objects that have been gazed at dur-ing the utterance) gaze at the speaker and averted gaze Infigure 5 the blue colour refers to the proximity area of thereferent object from the speakerrsquos utterance while orangerefers to the gaze at all other known objects combined onthe virtual environment Grey is for the gaze to the listenersor the speaker during the utterance and yellow for avertedgaze

61 Eye gaze during referring expressionsWe looked at all references to objects (N = 582) and com-pared the means of the proportional gaze of the speaker tothe proportional listenersrsquo gaze to the proximity area of thereferent objects the rest of the gazed objects and gaze to-wards each other We conducted paired sample t-tests andfound significant difference between the speaker gaze to theproximity area to the referent object (M = 4669 StdError= 137) against the listener gaze on the area around the ref-erent object (M = 3945 StdError = 111) with [t = 4942p = 0001] There was no significant difference howeveron the speakerrsquos gaze to the listeners against the gaze fromthe listeners to the speaker [t = -0816 p = 0415] (mutualgaze)There was also a significant difference on the proportionalgaze of the speaker to the area around the referent object(M = 4669 StdError = 137) against the proportional gazeto other objects during the same utterances (M = 3241StdError = 125) [t = 5887 p = 0001] However nosignificant difference was found on the same case for thelistenersrsquo gaze [t = -0767 p = 0444]

62 Eye gaze before referring expressionsPrevious studies have revealed that speakers typically lookat referent objects about 800-1000ms before mentioningthem (Staudte and Crocker 2011) We therefore lookedat 1 second before the utterance for all utterances (N =582) and the gaze proportions around the area to the ob-ject(s) that were about the be referred to We compared thespeakerrsquos gaze to the referent objects to the listenersrsquo gazeusing paired sample t-tests The speakerrsquos gaze (M = 4972StdError = 152) was different than the listenerrsquos gaze (M= 2853 StdError = 118) [t = 12928 p = 0001] No sig-nificant difference was found on the mutual gaze during thetime before the referring expression [t = -1421 p = 0156]There was also a significant difference on the proportionalgaze of the speaker to the referent object (M = 4972StdError = 152) towards gaze on other objects (M =3307 StdError = 137) [t = 6154 p = 0001]Finally we compared the proportional gaze of the speakerto the referent object during the utterance and in the 1 sec-ond period before the utterance however there was no sig-nificant difference [t = -1854 p = 0064] There was adifference however as expected on the listeners gaze dur-ing (M = 3945 StdError = 111) and before (M = 285309StdError = 118) the utterance [t = 9745 p = 0001]

63 Eye gaze timingFurther we investigated when each interlocutor turned theirgaze to the object that was referred to verbally In 450 out ofthe 582 cases the speaker was already looking at the objectin a time window of 1 second prior to their verbal referenceto it On average during the 1s window they turned theirgaze to the object 0796s before uttering the referring ex-pression [t = 83157 p = 0001] which is supported by theliteratureAt least one of the listeners looked at the referent objectsduring the speakerrsquos utterance in 537 out of the 582 casesOn average they started looking at the proximity area of the

124

Figure 5 Mean proportional gaze per referring expressionfor speaker and listeners a) The speakerrsquos gaze during thetime of the reference b) The combined listenersrsquo gaze dur-ing the speakerrsquos referring expression c) The speakerrsquosgaze during the 1 second period before the reference d)The combined listenersrsquo gaze during the 1 second periodbefore the speakerrsquos referring expression

referent object 3445ms after the speaker started uttering areferring expression [t = 19355 p = 0001]

64 Mutual gaze and joint attentionWe extracted all occasions where the group was jointly at-tending the area around the referent object but also the oc-casions where either the speaker or the none of the listenerslooked at the object table 1

N = 582 BU [-10] DU [0t]None 58 21Both 339 479

Speaker only 111 24Listeners only 74 58

Table 1 Joint attention to the referent object area before(BU) and during (DU) speakerrsquos references

During the preliminary analysis of the corpus we noticedthat in many cases at least one of the listeners was alreadylooking at the area around the referent object before thespeaker would utter a referring expression (figure 5) Itwas our intuition that the salient object was already in thegrouprsquos attention before We looked at -1s before the ut-terance and automatically extracted these cases as can beseen on table 2

N = 582 BU [-1s]None 163Both 168

Speaker only 155Listeners only 96

Table 2 Count of occasions where the referent object hasalready been in focus of attention (-1s before speakerrsquos re-ferring expression)

Finally we looked at the occasions where the interlocu-tors established mutual gaze The table below shows caseswhere the speaker looked at one of the listeners or where atleast one of the listeners looked at the speaker during andbefore referring expressions Mutual gaze indicates wherethe speaker and one of the listeners look at each other andas can be seen this is very rare during or before referringexpressions

N = 582 BU [-10] DU [0t]Mutual gaze 10 20

Speaker at Listeners 52 72Listeners at Speaker 80 143

Table 3 Count of occasions where the speaker looked atthe listeners the listeners looked at the speaker and mutualgaze during and before the speakerrsquos referring expressions

7 DiscussionThe presented corpus contains recordings of a variety ofmultimodal data is processed and annotated and providesresearchers with the possibility to explore multi-modalmulti-party turn-taking behaviours in situated interactionWhile its strength lies in the rich annotation and the vari-ation in conversational dynamics it also has some limita-tionsOne of the limitations is the granularity of eye-gaze anno-tation Even though a high-end eye-gaze tracking systemwas used it was not always possible to disambiguate whichof the potential objects were being gazed at Given a visualangle threshold we are provided with a certain confidencemeasure defined by the angle difference to each object to-wards identifying the objects of attention That limited ouranalysis to the area around the object rather than a precisemeasure of the object itself Another limitation is that wedid not analyse the object movements or the participantsrsquopointing gestures These might explain some of the visualattention cases prior to the verbal referring expressionsMoreover the use of the eye tracking glasses and mo-tion capture equipment had the disadvantage that theywere quite intrusive Participants complained about fa-tigue at the end of the interactions and it was also quali-tatively observed that their gaze-head movement coordina-tion changed once wearing the glassesIn most occasions the listeners and the speakers were look-ing at the area of the referent object before the referringexpression was uttered which could potentially mean thatthe object was already on the grouprsquos attention In somecases the listenersrsquo visual attention was brought to the ob-ject area by referring language Similarly to the referredliterature this potentially shows that gaze has indicationson the salient objects in a grouprsquos discussion and that canbe used for reference resolution It is also our intuition thatobjects that establish higher fixation density during refer-ring expressions are considered to be more salient and canpotentially resolve the referencesThere are a few cases where neither the speaker nor thelistener looked at the referent objects in such cases textsaliency algorithms (Evangelopoulos et al 2009) or other

125

multimodal cues such as pointing gestures (Lucking et al2015) could be combined to resolve the reference to thesalient objectsTypically speakers direct the listenersrsquo attention usingverbal and non-verbal cues and listeners often read thespeakerrsquos visual attention during referring expressions toget indication on the referent objects As in literature wefound that speakers and listeners gazed at each other a lotduring the references to establish grounding on the referentobjects In very few cases however they also establishedmutual gaze (looking at each other at the same time) duringthose references

8 ConclusionsThe current paper presents a corpus of multi-party situatedinteraction It is fully transcribed and automatically anno-tated for eye-gaze gestures and spoken language More-over it features an automatic eye-gaze annotation methodwhere the participantrsquos gaze is resolved in 3d space a visu-alisation tool is also used to qualitatively examine parts orthe whole corpus in terms of conversational dynamics turntaking and reference resolution We annotated object ref-erences and investigated the proportional gaze from boththe perspective of the speaker and the listeners Finallywe quantitatively described the data of the corpus and gavefurther indications on how the corpus can be useful by theresearch community Both the annotated corpus and vi-sualisations are available at httpswwwkthseprofiledikopagematerial

9 AcknowledgementsThe authors would like to thank Joseph Mendelson formoderating the sessions and helping with the recordings ofthis corpus We would also like to acknowledge the supportfrom the EU Horizon 2020 project BabyRobot (687831)and the Swedish Foundation for Strategic Research projectFACT (GMT14-0082)

10 Bibliographical ReferencesAdmoni H and Scassellati B (2017) Social eye gaze in

human-robot interaction a review Journal of Human-Robot Interaction 6(1)25ndash63

Bohus D and Horvitz E (2009) Models for multipartyengagement in open-world dialog In Proceedings of theSIGDIAL 2009 Conference The 10th Annual Meeting ofthe Special Interest Group on Discourse and Dialoguepages 225ndash234 Association for Computational Linguis-tics

Bohus D and Horvitz E (2010) Facilitating multi-party dialog with gaze gesture and speech In Inter-national Conference on Multimodal Interfaces and theWorkshop on Machine Learning for Multimodal Interac-tion page 5 ACM

Bolt R A (1980) rsquoPut-that-therersquo Voice and gesture atthe graphics interface volume 14 ACM

Borji A and Itti L (2013) State-of-the-art in visual at-tention modeling IEEE transactions on pattern analysisand machine intelligence 35(1)185ndash207

Bruce N D and Tsotsos J K (2009) Saliency attentionand visual search An information theoretic approachJournal of vision 9(3)5ndash5

Carletta J (2007) Unleashing the killer corpus experi-ences in creating the multi-everything ami meeting cor-pus Language Resources and Evaluation 41(2)181ndash190

Evangelopoulos G Zlatintsi A Skoumas G Ra-pantzikos K Potamianos A Maragos P and AvrithisY (2009) Video event detection and summarization us-ing audio visual and text saliency In Acoustics Speechand Signal Processing 2009 ICASSP 2009 IEEE Inter-national Conference on pages 3553ndash3556 IEEE

Georgeton L and Meunier C (2015) Spontaneousspeech production by dysarthric and healthy speakersTemporal organisation and speaking rate In 18th Inter-national Congress of Phonetic Sciences Glasgow UKsubmitted

Goldberg L R (1992) The development of markers forthe big-five factor structure Psychological assessment4(1)26

Gross S Krenn B and Scheutz M (2017) The re-liability of non-verbal cues for situated reference res-olution and their interplay with language implicationsfor human robot interaction In Proceedings of the 19thACM International Conference on Multimodal Interac-tion pages 189ndash196 ACM

Grosz B J and Sidner C L (1986) Attention intentionsand the structure of discourse Computational linguis-tics 12(3)175ndash204

Hung H and Chittaranjan G (2010) The idiap wolf cor-pus exploring group behaviour in a competitive role-playing game In Proceedings of the 18th ACM interna-tional conference on Multimedia pages 879ndash882 ACM

Kennington C Kousidis S and Schlangen D (2013)Interpreting situated dialogue utterances an updatemodel that uses speech gaze and gesture informationProceedings of SIGdial 2013

Lucking A Bergmann K Hahn F Kopp S and RieserH (2010) The bielefeld speech and gesture alignmentcorpus (saga) In LREC 2010 workshop Multimodalcorporandashadvances in capturing coding and analyzingmultimodality

Lucking A Pfeiffer T and Rieser H (2015) Point-ing and reference reconsidered Journal of Pragmatics7756ndash79

Meena R Skantze G and Gustafson J (2014) Data-driven models for timing feedback responses in a maptask dialogue system Computer Speech amp Language28(4)903ndash922

Mehlmann G Haring M Janowski K Baur T Geb-hard P and Andre E (2014) Exploring a model ofgaze for grounding in multimodal hri In Proceedings ofthe 16th International Conference on Multimodal Inter-action pages 247ndash254 ACM

Mostefa D Moreau N Choukri K Potamianos GChu S M Tyagi A Casas J R Turmo J Cristofore-tti L Tobia F et al (2007) The chil audiovisual cor-

126

pus for lecture and meeting analysis inside smart roomsLanguage Resources and Evaluation 41(3-4)389ndash407

Oertel C Cummins F Edlund J Wagner P and Camp-bell N (2013) D64 A corpus of richly recorded con-versational interaction Journal on Multimodal User In-terfaces 7(1-2)19ndash28

Oertel C Funes Mora K A Sheikhi S Odobez J-Mand Gustafson J (2014) Who will get the grant Amultimodal corpus for the analysis of conversational be-haviours in group interviews In Proceedings of the 2014Workshop on Understanding and Modeling MultipartyMultimodal Interactions pages 27ndash32 ACM

Prasov Z and Chai J Y (2008) Whatrsquos in a gaze therole of eye-gaze in reference resolution in multimodalconversational interfaces In Proceedings of the 13thinternational conference on Intelligent user interfacespages 20ndash29 ACM

Prasov Z and Chai J Y (2010) Fusing eye gaze withspeech recognition hypotheses to resolve exophoric ref-erences in situated dialogue In Proceedings of the 2010Conference on Empirical Methods in Natural LanguageProcessing pages 471ndash481 Association for Computa-tional Linguistics

Rayner K (1995) Eye movements and cognitive pro-cesses in reading visual search and scene perceptionIn Studies in visual information processing volume 6pages 3ndash22 Elsevier

Schlangen D and Skantze G (2009) A general abstractmodel of incremental dialogue processing In Proceed-ings of the 12th Conference of the European Chapterof the Association for Computational Linguistics pages710ndash718 Association for Computational Linguistics

Sheikhi S and Odobez J-M (2012) Recognizing the vi-sual focus of attention for human robot interaction In In-ternational Workshop on Human Behavior Understand-ing pages 99ndash112 Springer

Skantze G Johansson M and Beskow J (2015) Ex-ploring turn-taking cues in multi-party human-robot dis-cussions about objects In Proceedings of the 2015 ACMon International Conference on Multimodal Interactionpages 67ndash74 ACM

Staudte M and Crocker M W (2011) Investigating jointattention mechanisms through spoken humanndashrobot in-teraction Cognition 120(2)268ndash291

Stefanov K and Beskow J (2016) A multi-party multi-modal dataset for focus of visual attention in human-human and human-robot interaction In Proceedings ofthe 10th edition of the Language Resources and Evalua-tion Conference (LREC 2016 23-28 of May ELRA

Sziklai G (1956) Some studies in the speed of visualperception IRE Transactions on Information Theory2(3)125ndash128

Winograd T (1972) Understanding natural languageCognitive psychology 3(1)1ndash191

127

  • Introduction
  • Background
    • Multiparty multimodal corpora
    • Social eye-gaze
    • Multimodal reference resolution
      • Data collection
        • Scenario
        • Participants
        • Corpus
        • Experimental setup
        • Motion capture
        • Eye gaze
        • Speech and language
        • Facial expressions
        • Application
        • Subjective measures
        • Calibration
          • Annotations
          • Data processing
            • Data synchronisation
            • Visualisation and visual angle threshold
            • Automatic annotation of gaze data
              • Results
                • Eye gaze during referring expressions
                • Eye gaze before referring expressions
                • Eye gaze timing
                • Mutual gaze and joint attention
                  • Discussion
                  • Conclusions
                  • Acknowledgements
                  • Bibliographical References
Page 6: A Multimodal Corpus for Mutual Gaze and Joint …of-the-art in multimodal multiparty corpora and relevant works in various types social eye gaze. We further describe our process in

For each object oi we gathered the proportional gaze datapoints and counted the amount of time the object was gazedduring the time t of an utterance As a gaze target we de-fined the area around any object that is within the thresh-old of the visual angle defined above In many cases morethan one objects competed for saliency in the same gazetarget area We therefore defined as joint attention not thegroup of objects gazed by the interlocutors but the gaze atthe same area around the prominent object given the visualangle threshold

6 Results

In the current section we describe gaze patterns as ob-served between and across the two conditions and propor-tional gaze during and before referring expressions In fig-ure 4 we show the proportional eye-gaze of the moderatoras well as the participantsrsquo eye gaze on the display As canbe observed the participants spent more time gazing on thedisplay than at the moderator or at each other which wasexpected to observe during a task-oriented dialogue

Figure 4 Proportional amount of participant eye-gaze persession on the display during the second condition (task-oriented dialogue)

For each utterance a set of prominent objects was definedfrom the annotations Given the gaze targets per partici-pant we calculated the proportional gaze from the speakerand the listeners during the time of the utterance and ex-actly 1 second before the utterance Since all interactionswere triadic there were two listeners and one speaker at alltimes To compare across all utterances we looked at themean proportional gaze of the two listeners to the area ofthe prominent object to define them as the listener groupWe then compared the gaze of the listener group to thespeakers gaze close to the referent objects We also lookedat the proportional combined gaze to other objects duringthe utterance (all other objects that have been gazed at dur-ing the utterance) gaze at the speaker and averted gaze Infigure 5 the blue colour refers to the proximity area of thereferent object from the speakerrsquos utterance while orangerefers to the gaze at all other known objects combined onthe virtual environment Grey is for the gaze to the listenersor the speaker during the utterance and yellow for avertedgaze

61 Eye gaze during referring expressionsWe looked at all references to objects (N = 582) and com-pared the means of the proportional gaze of the speaker tothe proportional listenersrsquo gaze to the proximity area of thereferent objects the rest of the gazed objects and gaze to-wards each other We conducted paired sample t-tests andfound significant difference between the speaker gaze to theproximity area to the referent object (M = 4669 StdError= 137) against the listener gaze on the area around the ref-erent object (M = 3945 StdError = 111) with [t = 4942p = 0001] There was no significant difference howeveron the speakerrsquos gaze to the listeners against the gaze fromthe listeners to the speaker [t = -0816 p = 0415] (mutualgaze)There was also a significant difference on the proportionalgaze of the speaker to the area around the referent object(M = 4669 StdError = 137) against the proportional gazeto other objects during the same utterances (M = 3241StdError = 125) [t = 5887 p = 0001] However nosignificant difference was found on the same case for thelistenersrsquo gaze [t = -0767 p = 0444]

62 Eye gaze before referring expressionsPrevious studies have revealed that speakers typically lookat referent objects about 800-1000ms before mentioningthem (Staudte and Crocker 2011) We therefore lookedat 1 second before the utterance for all utterances (N =582) and the gaze proportions around the area to the ob-ject(s) that were about the be referred to We compared thespeakerrsquos gaze to the referent objects to the listenersrsquo gazeusing paired sample t-tests The speakerrsquos gaze (M = 4972StdError = 152) was different than the listenerrsquos gaze (M= 2853 StdError = 118) [t = 12928 p = 0001] No sig-nificant difference was found on the mutual gaze during thetime before the referring expression [t = -1421 p = 0156]There was also a significant difference on the proportionalgaze of the speaker to the referent object (M = 4972StdError = 152) towards gaze on other objects (M =3307 StdError = 137) [t = 6154 p = 0001]Finally we compared the proportional gaze of the speakerto the referent object during the utterance and in the 1 sec-ond period before the utterance however there was no sig-nificant difference [t = -1854 p = 0064] There was adifference however as expected on the listeners gaze dur-ing (M = 3945 StdError = 111) and before (M = 285309StdError = 118) the utterance [t = 9745 p = 0001]

63 Eye gaze timingFurther we investigated when each interlocutor turned theirgaze to the object that was referred to verbally In 450 out ofthe 582 cases the speaker was already looking at the objectin a time window of 1 second prior to their verbal referenceto it On average during the 1s window they turned theirgaze to the object 0796s before uttering the referring ex-pression [t = 83157 p = 0001] which is supported by theliteratureAt least one of the listeners looked at the referent objectsduring the speakerrsquos utterance in 537 out of the 582 casesOn average they started looking at the proximity area of the

124

Figure 5 Mean proportional gaze per referring expressionfor speaker and listeners a) The speakerrsquos gaze during thetime of the reference b) The combined listenersrsquo gaze dur-ing the speakerrsquos referring expression c) The speakerrsquosgaze during the 1 second period before the reference d)The combined listenersrsquo gaze during the 1 second periodbefore the speakerrsquos referring expression

referent object 3445ms after the speaker started uttering areferring expression [t = 19355 p = 0001]

64 Mutual gaze and joint attentionWe extracted all occasions where the group was jointly at-tending the area around the referent object but also the oc-casions where either the speaker or the none of the listenerslooked at the object table 1

N = 582 BU [-10] DU [0t]None 58 21Both 339 479

Speaker only 111 24Listeners only 74 58

Table 1 Joint attention to the referent object area before(BU) and during (DU) speakerrsquos references

During the preliminary analysis of the corpus we noticedthat in many cases at least one of the listeners was alreadylooking at the area around the referent object before thespeaker would utter a referring expression (figure 5) Itwas our intuition that the salient object was already in thegrouprsquos attention before We looked at -1s before the ut-terance and automatically extracted these cases as can beseen on table 2

N = 582 BU [-1s]None 163Both 168

Speaker only 155Listeners only 96

Table 2 Count of occasions where the referent object hasalready been in focus of attention (-1s before speakerrsquos re-ferring expression)

Finally we looked at the occasions where the interlocu-tors established mutual gaze The table below shows caseswhere the speaker looked at one of the listeners or where atleast one of the listeners looked at the speaker during andbefore referring expressions Mutual gaze indicates wherethe speaker and one of the listeners look at each other andas can be seen this is very rare during or before referringexpressions

N = 582 BU [-10] DU [0t]Mutual gaze 10 20

Speaker at Listeners 52 72Listeners at Speaker 80 143

Table 3 Count of occasions where the speaker looked atthe listeners the listeners looked at the speaker and mutualgaze during and before the speakerrsquos referring expressions

7 DiscussionThe presented corpus contains recordings of a variety ofmultimodal data is processed and annotated and providesresearchers with the possibility to explore multi-modalmulti-party turn-taking behaviours in situated interactionWhile its strength lies in the rich annotation and the vari-ation in conversational dynamics it also has some limita-tionsOne of the limitations is the granularity of eye-gaze anno-tation Even though a high-end eye-gaze tracking systemwas used it was not always possible to disambiguate whichof the potential objects were being gazed at Given a visualangle threshold we are provided with a certain confidencemeasure defined by the angle difference to each object to-wards identifying the objects of attention That limited ouranalysis to the area around the object rather than a precisemeasure of the object itself Another limitation is that wedid not analyse the object movements or the participantsrsquopointing gestures These might explain some of the visualattention cases prior to the verbal referring expressionsMoreover the use of the eye tracking glasses and mo-tion capture equipment had the disadvantage that theywere quite intrusive Participants complained about fa-tigue at the end of the interactions and it was also quali-tatively observed that their gaze-head movement coordina-tion changed once wearing the glassesIn most occasions the listeners and the speakers were look-ing at the area of the referent object before the referringexpression was uttered which could potentially mean thatthe object was already on the grouprsquos attention In somecases the listenersrsquo visual attention was brought to the ob-ject area by referring language Similarly to the referredliterature this potentially shows that gaze has indicationson the salient objects in a grouprsquos discussion and that canbe used for reference resolution It is also our intuition thatobjects that establish higher fixation density during refer-ring expressions are considered to be more salient and canpotentially resolve the referencesThere are a few cases where neither the speaker nor thelistener looked at the referent objects in such cases textsaliency algorithms (Evangelopoulos et al 2009) or other

125

multimodal cues such as pointing gestures (Lucking et al2015) could be combined to resolve the reference to thesalient objectsTypically speakers direct the listenersrsquo attention usingverbal and non-verbal cues and listeners often read thespeakerrsquos visual attention during referring expressions toget indication on the referent objects As in literature wefound that speakers and listeners gazed at each other a lotduring the references to establish grounding on the referentobjects In very few cases however they also establishedmutual gaze (looking at each other at the same time) duringthose references

8 ConclusionsThe current paper presents a corpus of multi-party situatedinteraction It is fully transcribed and automatically anno-tated for eye-gaze gestures and spoken language More-over it features an automatic eye-gaze annotation methodwhere the participantrsquos gaze is resolved in 3d space a visu-alisation tool is also used to qualitatively examine parts orthe whole corpus in terms of conversational dynamics turntaking and reference resolution We annotated object ref-erences and investigated the proportional gaze from boththe perspective of the speaker and the listeners Finallywe quantitatively described the data of the corpus and gavefurther indications on how the corpus can be useful by theresearch community Both the annotated corpus and vi-sualisations are available at httpswwwkthseprofiledikopagematerial

9 AcknowledgementsThe authors would like to thank Joseph Mendelson formoderating the sessions and helping with the recordings ofthis corpus We would also like to acknowledge the supportfrom the EU Horizon 2020 project BabyRobot (687831)and the Swedish Foundation for Strategic Research projectFACT (GMT14-0082)

10 Bibliographical ReferencesAdmoni H and Scassellati B (2017) Social eye gaze in

human-robot interaction a review Journal of Human-Robot Interaction 6(1)25ndash63

Bohus D and Horvitz E (2009) Models for multipartyengagement in open-world dialog In Proceedings of theSIGDIAL 2009 Conference The 10th Annual Meeting ofthe Special Interest Group on Discourse and Dialoguepages 225ndash234 Association for Computational Linguis-tics

Bohus D and Horvitz E (2010) Facilitating multi-party dialog with gaze gesture and speech In Inter-national Conference on Multimodal Interfaces and theWorkshop on Machine Learning for Multimodal Interac-tion page 5 ACM

Bolt R A (1980) rsquoPut-that-therersquo Voice and gesture atthe graphics interface volume 14 ACM

Borji A and Itti L (2013) State-of-the-art in visual at-tention modeling IEEE transactions on pattern analysisand machine intelligence 35(1)185ndash207

Bruce N D and Tsotsos J K (2009) Saliency attentionand visual search An information theoretic approachJournal of vision 9(3)5ndash5

Carletta J (2007) Unleashing the killer corpus experi-ences in creating the multi-everything ami meeting cor-pus Language Resources and Evaluation 41(2)181ndash190

Evangelopoulos G Zlatintsi A Skoumas G Ra-pantzikos K Potamianos A Maragos P and AvrithisY (2009) Video event detection and summarization us-ing audio visual and text saliency In Acoustics Speechand Signal Processing 2009 ICASSP 2009 IEEE Inter-national Conference on pages 3553ndash3556 IEEE

Georgeton L and Meunier C (2015) Spontaneousspeech production by dysarthric and healthy speakersTemporal organisation and speaking rate In 18th Inter-national Congress of Phonetic Sciences Glasgow UKsubmitted

Goldberg L R (1992) The development of markers forthe big-five factor structure Psychological assessment4(1)26

Gross S Krenn B and Scheutz M (2017) The re-liability of non-verbal cues for situated reference res-olution and their interplay with language implicationsfor human robot interaction In Proceedings of the 19thACM International Conference on Multimodal Interac-tion pages 189ndash196 ACM

Grosz B J and Sidner C L (1986) Attention intentionsand the structure of discourse Computational linguis-tics 12(3)175ndash204

Hung H and Chittaranjan G (2010) The idiap wolf cor-pus exploring group behaviour in a competitive role-playing game In Proceedings of the 18th ACM interna-tional conference on Multimedia pages 879ndash882 ACM

Kennington C Kousidis S and Schlangen D (2013)Interpreting situated dialogue utterances an updatemodel that uses speech gaze and gesture informationProceedings of SIGdial 2013

Lucking A Bergmann K Hahn F Kopp S and RieserH (2010) The bielefeld speech and gesture alignmentcorpus (saga) In LREC 2010 workshop Multimodalcorporandashadvances in capturing coding and analyzingmultimodality

Lucking A Pfeiffer T and Rieser H (2015) Point-ing and reference reconsidered Journal of Pragmatics7756ndash79

Meena R Skantze G and Gustafson J (2014) Data-driven models for timing feedback responses in a maptask dialogue system Computer Speech amp Language28(4)903ndash922

Mehlmann G Haring M Janowski K Baur T Geb-hard P and Andre E (2014) Exploring a model ofgaze for grounding in multimodal hri In Proceedings ofthe 16th International Conference on Multimodal Inter-action pages 247ndash254 ACM

Mostefa D Moreau N Choukri K Potamianos GChu S M Tyagi A Casas J R Turmo J Cristofore-tti L Tobia F et al (2007) The chil audiovisual cor-

126

pus for lecture and meeting analysis inside smart roomsLanguage Resources and Evaluation 41(3-4)389ndash407

Oertel C Cummins F Edlund J Wagner P and Camp-bell N (2013) D64 A corpus of richly recorded con-versational interaction Journal on Multimodal User In-terfaces 7(1-2)19ndash28

Oertel C Funes Mora K A Sheikhi S Odobez J-Mand Gustafson J (2014) Who will get the grant Amultimodal corpus for the analysis of conversational be-haviours in group interviews In Proceedings of the 2014Workshop on Understanding and Modeling MultipartyMultimodal Interactions pages 27ndash32 ACM

Prasov Z and Chai J Y (2008) Whatrsquos in a gaze therole of eye-gaze in reference resolution in multimodalconversational interfaces In Proceedings of the 13thinternational conference on Intelligent user interfacespages 20ndash29 ACM

Prasov Z and Chai J Y (2010) Fusing eye gaze withspeech recognition hypotheses to resolve exophoric ref-erences in situated dialogue In Proceedings of the 2010Conference on Empirical Methods in Natural LanguageProcessing pages 471ndash481 Association for Computa-tional Linguistics

Rayner K (1995) Eye movements and cognitive pro-cesses in reading visual search and scene perceptionIn Studies in visual information processing volume 6pages 3ndash22 Elsevier

Schlangen D and Skantze G (2009) A general abstractmodel of incremental dialogue processing In Proceed-ings of the 12th Conference of the European Chapterof the Association for Computational Linguistics pages710ndash718 Association for Computational Linguistics

Sheikhi S and Odobez J-M (2012) Recognizing the vi-sual focus of attention for human robot interaction In In-ternational Workshop on Human Behavior Understand-ing pages 99ndash112 Springer

Skantze G Johansson M and Beskow J (2015) Ex-ploring turn-taking cues in multi-party human-robot dis-cussions about objects In Proceedings of the 2015 ACMon International Conference on Multimodal Interactionpages 67ndash74 ACM

Staudte M and Crocker M W (2011) Investigating jointattention mechanisms through spoken humanndashrobot in-teraction Cognition 120(2)268ndash291

Stefanov K and Beskow J (2016) A multi-party multi-modal dataset for focus of visual attention in human-human and human-robot interaction In Proceedings ofthe 10th edition of the Language Resources and Evalua-tion Conference (LREC 2016 23-28 of May ELRA

Sziklai G (1956) Some studies in the speed of visualperception IRE Transactions on Information Theory2(3)125ndash128

Winograd T (1972) Understanding natural languageCognitive psychology 3(1)1ndash191

127

  • Introduction
  • Background
    • Multiparty multimodal corpora
    • Social eye-gaze
    • Multimodal reference resolution
      • Data collection
        • Scenario
        • Participants
        • Corpus
        • Experimental setup
        • Motion capture
        • Eye gaze
        • Speech and language
        • Facial expressions
        • Application
        • Subjective measures
        • Calibration
          • Annotations
          • Data processing
            • Data synchronisation
            • Visualisation and visual angle threshold
            • Automatic annotation of gaze data
              • Results
                • Eye gaze during referring expressions
                • Eye gaze before referring expressions
                • Eye gaze timing
                • Mutual gaze and joint attention
                  • Discussion
                  • Conclusions
                  • Acknowledgements
                  • Bibliographical References
Page 7: A Multimodal Corpus for Mutual Gaze and Joint …of-the-art in multimodal multiparty corpora and relevant works in various types social eye gaze. We further describe our process in

Figure 5 Mean proportional gaze per referring expressionfor speaker and listeners a) The speakerrsquos gaze during thetime of the reference b) The combined listenersrsquo gaze dur-ing the speakerrsquos referring expression c) The speakerrsquosgaze during the 1 second period before the reference d)The combined listenersrsquo gaze during the 1 second periodbefore the speakerrsquos referring expression

referent object 3445ms after the speaker started uttering areferring expression [t = 19355 p = 0001]

64 Mutual gaze and joint attentionWe extracted all occasions where the group was jointly at-tending the area around the referent object but also the oc-casions where either the speaker or the none of the listenerslooked at the object table 1

N = 582 BU [-10] DU [0t]None 58 21Both 339 479

Speaker only 111 24Listeners only 74 58

Table 1 Joint attention to the referent object area before(BU) and during (DU) speakerrsquos references

During the preliminary analysis of the corpus we noticedthat in many cases at least one of the listeners was alreadylooking at the area around the referent object before thespeaker would utter a referring expression (figure 5) Itwas our intuition that the salient object was already in thegrouprsquos attention before We looked at -1s before the ut-terance and automatically extracted these cases as can beseen on table 2

N = 582 BU [-1s]None 163Both 168

Speaker only 155Listeners only 96

Table 2 Count of occasions where the referent object hasalready been in focus of attention (-1s before speakerrsquos re-ferring expression)

Finally we looked at the occasions where the interlocu-tors established mutual gaze The table below shows caseswhere the speaker looked at one of the listeners or where atleast one of the listeners looked at the speaker during andbefore referring expressions Mutual gaze indicates wherethe speaker and one of the listeners look at each other andas can be seen this is very rare during or before referringexpressions

N = 582 BU [-10] DU [0t]Mutual gaze 10 20

Speaker at Listeners 52 72Listeners at Speaker 80 143

Table 3 Count of occasions where the speaker looked atthe listeners the listeners looked at the speaker and mutualgaze during and before the speakerrsquos referring expressions

7 DiscussionThe presented corpus contains recordings of a variety ofmultimodal data is processed and annotated and providesresearchers with the possibility to explore multi-modalmulti-party turn-taking behaviours in situated interactionWhile its strength lies in the rich annotation and the vari-ation in conversational dynamics it also has some limita-tionsOne of the limitations is the granularity of eye-gaze anno-tation Even though a high-end eye-gaze tracking systemwas used it was not always possible to disambiguate whichof the potential objects were being gazed at Given a visualangle threshold we are provided with a certain confidencemeasure defined by the angle difference to each object to-wards identifying the objects of attention That limited ouranalysis to the area around the object rather than a precisemeasure of the object itself Another limitation is that wedid not analyse the object movements or the participantsrsquopointing gestures These might explain some of the visualattention cases prior to the verbal referring expressionsMoreover the use of the eye tracking glasses and mo-tion capture equipment had the disadvantage that theywere quite intrusive Participants complained about fa-tigue at the end of the interactions and it was also quali-tatively observed that their gaze-head movement coordina-tion changed once wearing the glassesIn most occasions the listeners and the speakers were look-ing at the area of the referent object before the referringexpression was uttered which could potentially mean thatthe object was already on the grouprsquos attention In somecases the listenersrsquo visual attention was brought to the ob-ject area by referring language Similarly to the referredliterature this potentially shows that gaze has indicationson the salient objects in a grouprsquos discussion and that canbe used for reference resolution It is also our intuition thatobjects that establish higher fixation density during refer-ring expressions are considered to be more salient and canpotentially resolve the referencesThere are a few cases where neither the speaker nor thelistener looked at the referent objects in such cases textsaliency algorithms (Evangelopoulos et al 2009) or other

125

multimodal cues such as pointing gestures (Lucking et al2015) could be combined to resolve the reference to thesalient objectsTypically speakers direct the listenersrsquo attention usingverbal and non-verbal cues and listeners often read thespeakerrsquos visual attention during referring expressions toget indication on the referent objects As in literature wefound that speakers and listeners gazed at each other a lotduring the references to establish grounding on the referentobjects In very few cases however they also establishedmutual gaze (looking at each other at the same time) duringthose references

8 ConclusionsThe current paper presents a corpus of multi-party situatedinteraction It is fully transcribed and automatically anno-tated for eye-gaze gestures and spoken language More-over it features an automatic eye-gaze annotation methodwhere the participantrsquos gaze is resolved in 3d space a visu-alisation tool is also used to qualitatively examine parts orthe whole corpus in terms of conversational dynamics turntaking and reference resolution We annotated object ref-erences and investigated the proportional gaze from boththe perspective of the speaker and the listeners Finallywe quantitatively described the data of the corpus and gavefurther indications on how the corpus can be useful by theresearch community Both the annotated corpus and vi-sualisations are available at httpswwwkthseprofiledikopagematerial

9 AcknowledgementsThe authors would like to thank Joseph Mendelson formoderating the sessions and helping with the recordings ofthis corpus We would also like to acknowledge the supportfrom the EU Horizon 2020 project BabyRobot (687831)and the Swedish Foundation for Strategic Research projectFACT (GMT14-0082)

10 Bibliographical ReferencesAdmoni H and Scassellati B (2017) Social eye gaze in

human-robot interaction a review Journal of Human-Robot Interaction 6(1)25ndash63

Bohus D and Horvitz E (2009) Models for multipartyengagement in open-world dialog In Proceedings of theSIGDIAL 2009 Conference The 10th Annual Meeting ofthe Special Interest Group on Discourse and Dialoguepages 225ndash234 Association for Computational Linguis-tics

Bohus D and Horvitz E (2010) Facilitating multi-party dialog with gaze gesture and speech In Inter-national Conference on Multimodal Interfaces and theWorkshop on Machine Learning for Multimodal Interac-tion page 5 ACM

Bolt R A (1980) rsquoPut-that-therersquo Voice and gesture atthe graphics interface volume 14 ACM

Borji A and Itti L (2013) State-of-the-art in visual at-tention modeling IEEE transactions on pattern analysisand machine intelligence 35(1)185ndash207

Bruce N D and Tsotsos J K (2009) Saliency attentionand visual search An information theoretic approachJournal of vision 9(3)5ndash5

Carletta J (2007) Unleashing the killer corpus experi-ences in creating the multi-everything ami meeting cor-pus Language Resources and Evaluation 41(2)181ndash190

Evangelopoulos G Zlatintsi A Skoumas G Ra-pantzikos K Potamianos A Maragos P and AvrithisY (2009) Video event detection and summarization us-ing audio visual and text saliency In Acoustics Speechand Signal Processing 2009 ICASSP 2009 IEEE Inter-national Conference on pages 3553ndash3556 IEEE

Georgeton L and Meunier C (2015) Spontaneousspeech production by dysarthric and healthy speakersTemporal organisation and speaking rate In 18th Inter-national Congress of Phonetic Sciences Glasgow UKsubmitted

Goldberg L R (1992) The development of markers forthe big-five factor structure Psychological assessment4(1)26

Gross S Krenn B and Scheutz M (2017) The re-liability of non-verbal cues for situated reference res-olution and their interplay with language implicationsfor human robot interaction In Proceedings of the 19thACM International Conference on Multimodal Interac-tion pages 189ndash196 ACM

Grosz B J and Sidner C L (1986) Attention intentionsand the structure of discourse Computational linguis-tics 12(3)175ndash204

Hung H and Chittaranjan G (2010) The idiap wolf cor-pus exploring group behaviour in a competitive role-playing game In Proceedings of the 18th ACM interna-tional conference on Multimedia pages 879ndash882 ACM

Kennington C Kousidis S and Schlangen D (2013)Interpreting situated dialogue utterances an updatemodel that uses speech gaze and gesture informationProceedings of SIGdial 2013

Lucking A Bergmann K Hahn F Kopp S and RieserH (2010) The bielefeld speech and gesture alignmentcorpus (saga) In LREC 2010 workshop Multimodalcorporandashadvances in capturing coding and analyzingmultimodality

Lucking A Pfeiffer T and Rieser H (2015) Point-ing and reference reconsidered Journal of Pragmatics7756ndash79

Meena R Skantze G and Gustafson J (2014) Data-driven models for timing feedback responses in a maptask dialogue system Computer Speech amp Language28(4)903ndash922

Mehlmann G Haring M Janowski K Baur T Geb-hard P and Andre E (2014) Exploring a model ofgaze for grounding in multimodal hri In Proceedings ofthe 16th International Conference on Multimodal Inter-action pages 247ndash254 ACM

Mostefa D Moreau N Choukri K Potamianos GChu S M Tyagi A Casas J R Turmo J Cristofore-tti L Tobia F et al (2007) The chil audiovisual cor-

126

pus for lecture and meeting analysis inside smart roomsLanguage Resources and Evaluation 41(3-4)389ndash407

Oertel C Cummins F Edlund J Wagner P and Camp-bell N (2013) D64 A corpus of richly recorded con-versational interaction Journal on Multimodal User In-terfaces 7(1-2)19ndash28

Oertel C Funes Mora K A Sheikhi S Odobez J-Mand Gustafson J (2014) Who will get the grant Amultimodal corpus for the analysis of conversational be-haviours in group interviews In Proceedings of the 2014Workshop on Understanding and Modeling MultipartyMultimodal Interactions pages 27ndash32 ACM

Prasov Z and Chai J Y (2008) Whatrsquos in a gaze therole of eye-gaze in reference resolution in multimodalconversational interfaces In Proceedings of the 13thinternational conference on Intelligent user interfacespages 20ndash29 ACM

Prasov Z and Chai J Y (2010) Fusing eye gaze withspeech recognition hypotheses to resolve exophoric ref-erences in situated dialogue In Proceedings of the 2010Conference on Empirical Methods in Natural LanguageProcessing pages 471ndash481 Association for Computa-tional Linguistics

Rayner K (1995) Eye movements and cognitive pro-cesses in reading visual search and scene perceptionIn Studies in visual information processing volume 6pages 3ndash22 Elsevier

Schlangen D and Skantze G (2009) A general abstractmodel of incremental dialogue processing In Proceed-ings of the 12th Conference of the European Chapterof the Association for Computational Linguistics pages710ndash718 Association for Computational Linguistics

Sheikhi S and Odobez J-M (2012) Recognizing the vi-sual focus of attention for human robot interaction In In-ternational Workshop on Human Behavior Understand-ing pages 99ndash112 Springer

Skantze G Johansson M and Beskow J (2015) Ex-ploring turn-taking cues in multi-party human-robot dis-cussions about objects In Proceedings of the 2015 ACMon International Conference on Multimodal Interactionpages 67ndash74 ACM

Staudte M and Crocker M W (2011) Investigating jointattention mechanisms through spoken humanndashrobot in-teraction Cognition 120(2)268ndash291

Stefanov K and Beskow J (2016) A multi-party multi-modal dataset for focus of visual attention in human-human and human-robot interaction In Proceedings ofthe 10th edition of the Language Resources and Evalua-tion Conference (LREC 2016 23-28 of May ELRA

Sziklai G (1956) Some studies in the speed of visualperception IRE Transactions on Information Theory2(3)125ndash128

Winograd T (1972) Understanding natural languageCognitive psychology 3(1)1ndash191

127

  • Introduction
  • Background
    • Multiparty multimodal corpora
    • Social eye-gaze
    • Multimodal reference resolution
      • Data collection
        • Scenario
        • Participants
        • Corpus
        • Experimental setup
        • Motion capture
        • Eye gaze
        • Speech and language
        • Facial expressions
        • Application
        • Subjective measures
        • Calibration
          • Annotations
          • Data processing
            • Data synchronisation
            • Visualisation and visual angle threshold
            • Automatic annotation of gaze data
              • Results
                • Eye gaze during referring expressions
                • Eye gaze before referring expressions
                • Eye gaze timing
                • Mutual gaze and joint attention
                  • Discussion
                  • Conclusions
                  • Acknowledgements
                  • Bibliographical References
Page 8: A Multimodal Corpus for Mutual Gaze and Joint …of-the-art in multimodal multiparty corpora and relevant works in various types social eye gaze. We further describe our process in

multimodal cues such as pointing gestures (Lucking et al2015) could be combined to resolve the reference to thesalient objectsTypically speakers direct the listenersrsquo attention usingverbal and non-verbal cues and listeners often read thespeakerrsquos visual attention during referring expressions toget indication on the referent objects As in literature wefound that speakers and listeners gazed at each other a lotduring the references to establish grounding on the referentobjects In very few cases however they also establishedmutual gaze (looking at each other at the same time) duringthose references

8 ConclusionsThe current paper presents a corpus of multi-party situatedinteraction It is fully transcribed and automatically anno-tated for eye-gaze gestures and spoken language More-over it features an automatic eye-gaze annotation methodwhere the participantrsquos gaze is resolved in 3d space a visu-alisation tool is also used to qualitatively examine parts orthe whole corpus in terms of conversational dynamics turntaking and reference resolution We annotated object ref-erences and investigated the proportional gaze from boththe perspective of the speaker and the listeners Finallywe quantitatively described the data of the corpus and gavefurther indications on how the corpus can be useful by theresearch community Both the annotated corpus and vi-sualisations are available at httpswwwkthseprofiledikopagematerial

9 AcknowledgementsThe authors would like to thank Joseph Mendelson formoderating the sessions and helping with the recordings ofthis corpus We would also like to acknowledge the supportfrom the EU Horizon 2020 project BabyRobot (687831)and the Swedish Foundation for Strategic Research projectFACT (GMT14-0082)

10 Bibliographical ReferencesAdmoni H and Scassellati B (2017) Social eye gaze in

human-robot interaction a review Journal of Human-Robot Interaction 6(1)25ndash63

Bohus D and Horvitz E (2009) Models for multipartyengagement in open-world dialog In Proceedings of theSIGDIAL 2009 Conference The 10th Annual Meeting ofthe Special Interest Group on Discourse and Dialoguepages 225ndash234 Association for Computational Linguis-tics

Bohus D and Horvitz E (2010) Facilitating multi-party dialog with gaze gesture and speech In Inter-national Conference on Multimodal Interfaces and theWorkshop on Machine Learning for Multimodal Interac-tion page 5 ACM

Bolt R A (1980) rsquoPut-that-therersquo Voice and gesture atthe graphics interface volume 14 ACM

Borji A and Itti L (2013) State-of-the-art in visual at-tention modeling IEEE transactions on pattern analysisand machine intelligence 35(1)185ndash207

Bruce N D and Tsotsos J K (2009) Saliency attentionand visual search An information theoretic approachJournal of vision 9(3)5ndash5

Carletta J (2007) Unleashing the killer corpus experi-ences in creating the multi-everything ami meeting cor-pus Language Resources and Evaluation 41(2)181ndash190

Evangelopoulos G Zlatintsi A Skoumas G Ra-pantzikos K Potamianos A Maragos P and AvrithisY (2009) Video event detection and summarization us-ing audio visual and text saliency In Acoustics Speechand Signal Processing 2009 ICASSP 2009 IEEE Inter-national Conference on pages 3553ndash3556 IEEE

Georgeton L and Meunier C (2015) Spontaneousspeech production by dysarthric and healthy speakersTemporal organisation and speaking rate In 18th Inter-national Congress of Phonetic Sciences Glasgow UKsubmitted

Goldberg L R (1992) The development of markers forthe big-five factor structure Psychological assessment4(1)26

Gross S Krenn B and Scheutz M (2017) The re-liability of non-verbal cues for situated reference res-olution and their interplay with language implicationsfor human robot interaction In Proceedings of the 19thACM International Conference on Multimodal Interac-tion pages 189ndash196 ACM

Grosz B J and Sidner C L (1986) Attention intentionsand the structure of discourse Computational linguis-tics 12(3)175ndash204

Hung H and Chittaranjan G (2010) The idiap wolf cor-pus exploring group behaviour in a competitive role-playing game In Proceedings of the 18th ACM interna-tional conference on Multimedia pages 879ndash882 ACM

Kennington C Kousidis S and Schlangen D (2013)Interpreting situated dialogue utterances an updatemodel that uses speech gaze and gesture informationProceedings of SIGdial 2013

Lucking A Bergmann K Hahn F Kopp S and RieserH (2010) The bielefeld speech and gesture alignmentcorpus (saga) In LREC 2010 workshop Multimodalcorporandashadvances in capturing coding and analyzingmultimodality

Lucking A Pfeiffer T and Rieser H (2015) Point-ing and reference reconsidered Journal of Pragmatics7756ndash79

Meena R Skantze G and Gustafson J (2014) Data-driven models for timing feedback responses in a maptask dialogue system Computer Speech amp Language28(4)903ndash922

Mehlmann G Haring M Janowski K Baur T Geb-hard P and Andre E (2014) Exploring a model ofgaze for grounding in multimodal hri In Proceedings ofthe 16th International Conference on Multimodal Inter-action pages 247ndash254 ACM

Mostefa D Moreau N Choukri K Potamianos GChu S M Tyagi A Casas J R Turmo J Cristofore-tti L Tobia F et al (2007) The chil audiovisual cor-

126

pus for lecture and meeting analysis inside smart roomsLanguage Resources and Evaluation 41(3-4)389ndash407

Oertel C Cummins F Edlund J Wagner P and Camp-bell N (2013) D64 A corpus of richly recorded con-versational interaction Journal on Multimodal User In-terfaces 7(1-2)19ndash28

Oertel C Funes Mora K A Sheikhi S Odobez J-Mand Gustafson J (2014) Who will get the grant Amultimodal corpus for the analysis of conversational be-haviours in group interviews In Proceedings of the 2014Workshop on Understanding and Modeling MultipartyMultimodal Interactions pages 27ndash32 ACM

Prasov Z and Chai J Y (2008) Whatrsquos in a gaze therole of eye-gaze in reference resolution in multimodalconversational interfaces In Proceedings of the 13thinternational conference on Intelligent user interfacespages 20ndash29 ACM

Prasov Z and Chai J Y (2010) Fusing eye gaze withspeech recognition hypotheses to resolve exophoric ref-erences in situated dialogue In Proceedings of the 2010Conference on Empirical Methods in Natural LanguageProcessing pages 471ndash481 Association for Computa-tional Linguistics

Rayner K (1995) Eye movements and cognitive pro-cesses in reading visual search and scene perceptionIn Studies in visual information processing volume 6pages 3ndash22 Elsevier

Schlangen D and Skantze G (2009) A general abstractmodel of incremental dialogue processing In Proceed-ings of the 12th Conference of the European Chapterof the Association for Computational Linguistics pages710ndash718 Association for Computational Linguistics

Sheikhi S and Odobez J-M (2012) Recognizing the vi-sual focus of attention for human robot interaction In In-ternational Workshop on Human Behavior Understand-ing pages 99ndash112 Springer

Skantze G Johansson M and Beskow J (2015) Ex-ploring turn-taking cues in multi-party human-robot dis-cussions about objects In Proceedings of the 2015 ACMon International Conference on Multimodal Interactionpages 67ndash74 ACM

Staudte M and Crocker M W (2011) Investigating jointattention mechanisms through spoken humanndashrobot in-teraction Cognition 120(2)268ndash291

Stefanov K and Beskow J (2016) A multi-party multi-modal dataset for focus of visual attention in human-human and human-robot interaction In Proceedings ofthe 10th edition of the Language Resources and Evalua-tion Conference (LREC 2016 23-28 of May ELRA

Sziklai G (1956) Some studies in the speed of visualperception IRE Transactions on Information Theory2(3)125ndash128

Winograd T (1972) Understanding natural languageCognitive psychology 3(1)1ndash191

127

  • Introduction
  • Background
    • Multiparty multimodal corpora
    • Social eye-gaze
    • Multimodal reference resolution
      • Data collection
        • Scenario
        • Participants
        • Corpus
        • Experimental setup
        • Motion capture
        • Eye gaze
        • Speech and language
        • Facial expressions
        • Application
        • Subjective measures
        • Calibration
          • Annotations
          • Data processing
            • Data synchronisation
            • Visualisation and visual angle threshold
            • Automatic annotation of gaze data
              • Results
                • Eye gaze during referring expressions
                • Eye gaze before referring expressions
                • Eye gaze timing
                • Mutual gaze and joint attention
                  • Discussion
                  • Conclusions
                  • Acknowledgements
                  • Bibliographical References
Page 9: A Multimodal Corpus for Mutual Gaze and Joint …of-the-art in multimodal multiparty corpora and relevant works in various types social eye gaze. We further describe our process in

pus for lecture and meeting analysis inside smart roomsLanguage Resources and Evaluation 41(3-4)389ndash407

Oertel C Cummins F Edlund J Wagner P and Camp-bell N (2013) D64 A corpus of richly recorded con-versational interaction Journal on Multimodal User In-terfaces 7(1-2)19ndash28

Oertel C Funes Mora K A Sheikhi S Odobez J-Mand Gustafson J (2014) Who will get the grant Amultimodal corpus for the analysis of conversational be-haviours in group interviews In Proceedings of the 2014Workshop on Understanding and Modeling MultipartyMultimodal Interactions pages 27ndash32 ACM

Prasov Z and Chai J Y (2008) Whatrsquos in a gaze therole of eye-gaze in reference resolution in multimodalconversational interfaces In Proceedings of the 13thinternational conference on Intelligent user interfacespages 20ndash29 ACM

Prasov Z and Chai J Y (2010) Fusing eye gaze withspeech recognition hypotheses to resolve exophoric ref-erences in situated dialogue In Proceedings of the 2010Conference on Empirical Methods in Natural LanguageProcessing pages 471ndash481 Association for Computa-tional Linguistics

Rayner K (1995) Eye movements and cognitive pro-cesses in reading visual search and scene perceptionIn Studies in visual information processing volume 6pages 3ndash22 Elsevier

Schlangen D and Skantze G (2009) A general abstractmodel of incremental dialogue processing In Proceed-ings of the 12th Conference of the European Chapterof the Association for Computational Linguistics pages710ndash718 Association for Computational Linguistics

Sheikhi S and Odobez J-M (2012) Recognizing the vi-sual focus of attention for human robot interaction In In-ternational Workshop on Human Behavior Understand-ing pages 99ndash112 Springer

Skantze G Johansson M and Beskow J (2015) Ex-ploring turn-taking cues in multi-party human-robot dis-cussions about objects In Proceedings of the 2015 ACMon International Conference on Multimodal Interactionpages 67ndash74 ACM

Staudte M and Crocker M W (2011) Investigating jointattention mechanisms through spoken humanndashrobot in-teraction Cognition 120(2)268ndash291

Stefanov K and Beskow J (2016) A multi-party multi-modal dataset for focus of visual attention in human-human and human-robot interaction In Proceedings ofthe 10th edition of the Language Resources and Evalua-tion Conference (LREC 2016 23-28 of May ELRA

Sziklai G (1956) Some studies in the speed of visualperception IRE Transactions on Information Theory2(3)125ndash128

Winograd T (1972) Understanding natural languageCognitive psychology 3(1)1ndash191

127

  • Introduction
  • Background
    • Multiparty multimodal corpora
    • Social eye-gaze
    • Multimodal reference resolution
      • Data collection
        • Scenario
        • Participants
        • Corpus
        • Experimental setup
        • Motion capture
        • Eye gaze
        • Speech and language
        • Facial expressions
        • Application
        • Subjective measures
        • Calibration
          • Annotations
          • Data processing
            • Data synchronisation
            • Visualisation and visual angle threshold
            • Automatic annotation of gaze data
              • Results
                • Eye gaze during referring expressions
                • Eye gaze before referring expressions
                • Eye gaze timing
                • Mutual gaze and joint attention
                  • Discussion
                  • Conclusions
                  • Acknowledgements
                  • Bibliographical References