predicting video-conferencing conversation outcomes based ... ?· predicting video-conferencing...

Download Predicting Video-Conferencing Conversation Outcomes Based ... ?· Predicting Video-Conferencing Conversation…

Post on 18-Aug-2018




0 download

Embed Size (px)


  • Predicting Video-Conferencing Conversation Outcomes Based onModeling Facial Expression Synchronization

    Rui Li1, Jared Curhan2 and Mohammed (Ehsan) Hoque11 ROC HCI, Department of Computer Science, University of Rochester, New York, USA

    2 Sloan College of Management, Massachusetts Institute of Technology, Massachusetts, USA

    Abstract Effective video-conferencing conversations areheavily influenced by each speakers facial expression. In thisstudy, we propose a novel probabilistic model to representinteractional synchrony of conversation partners facial ex-pressions in video-conferencing communication. In particular,we use a hidden Markov model (HMM) to capture temporalproperties of each speakers facial expression sequence. Basedon the assumption of mutual influence between conversationpartners, we couple their HMMs as two interacting processes.Furthermore, we summarize the multiple coupled HMMs witha stochastic process prior to discover a set of facial synchro-nization templates shared among the multiple conversationpairs. We validate the model, by utilizing the exhibition ofthese facial synchronization templates to predict the outcomesof video-conferencing conversations. The dataset includes 75video-conferencing conversations from 150 Amazon MechanicalTurkers in the context of a new recruit negotiation. The resultsshow that our proposed model achieves higher accuracy inpredicting negotiation winners than support vector machineand canonical HMMs. Further analysis indicates that somesynchronized nonverbal templates contribute more in predictingthe negotiation outcomes.


    Video-conferencing (VC) becomes a popular platform forpeople to interact in professional and personal capacities [5].On the other hand, some technical issues still exist, forexample the limited view of the person, disengaged eyecontact, and occasional interruptions resulting from networklatency. These issues disrupt social presence, and thus leadto poor VC communication [3]. Nonverbal behavior playsa significant role to enhance social presence. It provides asource of rich information about the speakers intentions,goals, and values [1][3][4]. This motivates us to investigatefacial expressions in VC communication in order to gaininsight into effective communicative skills that will improveproductivity and conversational satisfaction.

    In this study, we investigate interactional synchrony offacial expressions in VC-mediated conversations, as shownin Figure 1. Interactional synchrony refers to patternedand aligned interactions occurring over time [11]. In asynchronic interaction, nonverbal behaviors (e.g., facial ex-pressions, posture, gesture) of the individuals are coordi-nated to the rhythms and forms of verbal expressions. Asa key indicator of interactional involvement, rapport, andmutuality, it has been used in deception detection, onlinelearning, interpersonal trust evaluation, and a variety ofother fields [1][7][8][10]. However, the quantification of

    This work was partially supported by DARPA

    Candidate Recruiter




    Fig. 1: An illustration of one VC conversation pair. The twoparticipants communicate via our web-based VC platformin their own natural environments. The lower panels showthe first six principal components of their facial expressionaction units (AUs) evolving over time.

    interactional synchrony is challenging, and it depends on thespecific social context. We address this challenge by mod-eling facial expression synchronization of VC conversationpartners given the social context of negotation.

    We propose a novel probabilistic model to learn an effec-tive representation of facial interactional synchrony. This rep-resentation contains a set of facial synchronization templatesdisplayed by multiple conversation pairs, as shown in Fig 2.In particular, we utilize a hidden Markov model (HMM) todescribe the temporal properties of each speakers facial ex-pression. The Markovian property assumes that if a speakersmiles at previous time step, it is likely that he/she maintainsthe smile at the current time step, for instance. We furtherassume that there exists mutual influence between a pair ofconversation partners. Namely, if a speakers conversationpartner displays a smile, it is likely that the speaker respondswith a smile. To capture the mutual influence between a pairof conversation partners, we couple their two HMMs togetheras interacting processes. We thus model the multiple conver-sation pairs with the corresponding multiple coupled HMMs.Furthermore, we summarize the multiple coupled HMMs byintroducing a stochastic process as a prior. This prior allowsus to uncover the shared facial synchronization templatesamong the multiple conversation pairs. In this representation,a couple of conversation partners facial expressions canbe decomposed into instantiations of a particular subset ofthe globally shared synchronization templates. This novelrepresentation of facial expression synchronization enables

  • Cand





    Expression Recognition

    Facial Expression Sequences AUs


    Time Series of the First 6 PCs

    Our Model

    Synchronized Facial Expression Templates

    Fig. 2: Diagram of our approach illustrated on one conversation pair. From left to right, 28 facial expression action units(AUs) are extracted from the conversation partners videos using CERT toolbox [9]; the time series of the first 6 principalcomponents are transformed from both the AUs with the principal component analysis (PCA), and these six-dimensionalPC time series are the input to our model; the model automatically decomposes the facial expression time series into thesalient segments (color coded) which correspond to a subset of globally shared facial synchronization templates displayedby this pair.

    us to not only interpret effective VC communication skillsbut also predict the outcomes of the conversations.

    To conduct this study, we develop a VC system thatworks via a web browser without any additional download orplugin support. The platform is designed to enable auto audioand video upload in a remote server every 30 seconds astwo people engage in a video-conference. This functionalityallows the framework to be deployed in Amazon MechanicalTurk, where remote human workers with access to a webbrowser and webcam communicate with each other. To thebest of our knowledge, this is the first study to investigateVC-mediated facial expression synchrony. The contributionsof our study include:

    We build a novel probabilistic model to learn an ef-fective representation of facial interactional synchronyin VC communication. This novel representation de-composes multiple pairs of conversation partners facialexpression sequences into a set of globally sharedsynchronization templates.

    We further represent a conversation by the frequenciesof occurrence of its facial synchronization templates,and achieve superior accuracy (78% on average) thansupport vector machine (SVM) and canonical HMMsto predict conversation outcomes.


    We validate our model using the dataset collected froma study engaging Mechanical Turkers in a recruitment ne-gotiation [2]. In this study, the conversational speech andfacial expressions are recorded. The outcomes are the numberof points earned by each participant and a post-negotiationquestionnaire regarding the participants evaluation of theircounterparts and the negotiation process.

    A. Participants

    242 Mechanical Turkers participate in the study. Partici-pants are informed that their negotiations would be recordedand that the studys purpose is to investigate negotiation

    skills. The data collected from 150 of the Turkers is availablefor further analysis. Among them, 43 participants (%29) arefemale. The remaining Turkers either had damaged videosor lacked post-questionnaire data.

    B. Apparatus

    The negotiators interact with each other through acomputer-mediated platform based on a browser-based VCsystem. The existing freely available video software (e.g.,Skype, Google+ Hangouts) often requires users to downloadtheir application or install a plugin. In addition, Skypescurrent API and Google+ do not allow us to capture andrecord audio or video streams. To handle these hurdles, wedevelop our own browser-based VC system that is capableof capturing and analyzing the video stream in the cloud. Weimplement the functionality to transfer audio and video dataevery 30 seconds to prevent data loss and dynamically adaptto variant network latency.

    C. Task

    The task of this experiment is defined as a recruitmentcase. A recruitment case involves a scenario in which acandidate, who already has an offer, negotiate the compen-sation package with the recruiter. The candidates and therecruiters need to reach an agreement on eight issues relatedto salary, job assignment, location, vacation time, bonus,moving expense reimbursement, starting date, and healthinsurance. Each negotiation issue offers 5 possible optionsfor resolution. Each option is associated with a specificnumber of points for each party. The goal of the negotiatorsis to maximize the total points they can possibly earn (e.g.,the 5 optional offers on salary issue range from 65K to 45K.Candidate receives maximum points if he/she could settlewith salary of 65k whereas recruiter loses maximum points,and vice versa).

    D. Procedure

    As Amazon mechanical Turkers take the HIT, they areformed into 75 conversation pairs sequentially. The social

  • )(i