generic representation learning for dynamic social interaction · generic representation learning...

Generic Representation Learning for Dynamic Social InteractionYanbang Wang

[email protected] University

Pan [email protected]

Stanford University, PurdueUniversity

Chongyang [email protected] College

V.S. [email protected] College

Jure [email protected] University

ABSTRACTSocial interactions, such as eye contact, speaking and listening,are ubiquitous in our life and carry important clues of human’ssocial status and psychological state. With evolving dynamics fun-damentally different from social relationships, the complex interac-tions among a group of people are another informative resourceto analyze patterns of social behaviors and characteristics. Despitethe great importance, previous approaches on extracting patternsfrom such dynamic social interactions are still underdeveloped andoverly task-specific.We fill this gap by proposing a temporal networkformulation of the problem, together with a representation learn-ing framework, temporal network-diffusion convolution networks(TNDCN). The framework accommodates the many downstreamtasks with a unified structure: we creatively propagate people’s fast-changing descriptive traits among their evolving gazing networkswith specially designed (1) network diffusion scheme and (2) hier-archical pooling to learn high-quality embeddings for downstreamtasks using a consistent structure and minimal feature engineering.Analysis show that (1) can not only capture patterns from existedinteractions but also people’s avoidance of interactions that turn outjust as critical. (2) allows us to flexibly collect different fine-grainedcritical interaction features scattered over an extremely long timespan, which is also key to success while it empirically fails almostall the previous temporal GNNs based on recurrent structures. Weevaluate our model over three different prediction tasks, detectingdeception, dominance and nervousness. Our model not only con-sistently outperforms previous baselines but also provides goodinterpretation by implying two important pieces of social insightderived from the learned coefficients.

ACM Reference Format:YanbangWang, Pan Li, Chongyang Bai, V.S. Subrahmanian, and Jure Leskovec.2020. Generic Representation Learning for Dynamic Social Interaction. InProceedings of KDD ’20: Knowledge Discovery in Databases (KDD MLG’20).ACM,NewYork, NY, USA, 9 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] MLG’20, Aug 24–27, 2020, San Diego, CA© 2020 Association for Computing Machinery.ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00https://doi.org/10.1145/nnnnnnn.nnnnnnn

1 INTRODUCTIONSocial interactions, referring to numerous and complicated actionsamong two or more people, have woven themselves into every pieceof our daily life [33]. They are one of the most ubiquitous forms thatpeople connect with each other, and can happen whenever severalpeople meet physically or online with video meeting tools. Suchinteractions, featured by synchronously occurred high-frequenteye contacts, fast-changing facial expressions, voice features andeven physical proximity, have evolving dynamics fundamentallydifferent from acquaintance-based social networks.

With such informative signals that characterize complex humangroup behaviors, psychological state and social-economic status,social interactions become critical data resources for social sci-entists to study patterns of human behaviors and make furtherinferences [25]. For example, where, when and how people interactwith others may provide informative cues to solve various socialtasks including deception detection [4, 13], dominance identifica-tion [5, 8], personality traits characterization [30] and friendshipinference [16].

However, we recognize that existing literature [4, 5, 8, 14, 15, 41]on each of these tasks primarily rely on handcrafted features thatare hardly transferable to each other, This makes the modelingprocess overly task-specific and high-demanding for domain ex-pertise. In that regard, we propose a more generic framework thatformulates social interactions with such vigorous dynamics by atemporal network for more unified representation learning. Centralto this prototype is the usage of people’s evolving eye-gazing orspeak-and-listen-to relationships for temporal network construc-tion. With eye contacts as the key to signal information flows [36]in a social interaction, both the explicit messages and people’s un-derlying influence on each other can now be naturally modeled intoa graph diffusion process, which essentially instantiates a variant ofthe powerful (temporal) graph neural network [23]. We term thistemporal network model dynamic social interaction network.

There are several temporal GNN frameworks proposed recentlyfor representation of generic temporal networks. However, they arenot well-suited to our downstream tasks. One important reason isthat they are primarily designed to fit the occurrence/attributes oftemporal edges and thus almost always place an imbalanced weighton events towards the sequence’s end [11, 29, 39, 47]. However, ourtasks (deception detection for example) essentially need to collectthe different, sporadic and potentially overlapping traits of com-munication throughout the interaction. Such mismatch becomesespecially noticeable with long, fine-grained interaction sequence

https://doi.org/10.1145/nnnnnnn.nnnnnnn

https://doi.org/10.1145/nnnnnnn.nnnnnnn

KDD MLG’20, Aug 24–27, 2020, San Diego, CA Yanbang Wang, Pan Li, Chongyang Bai, V.S. Subrahmanian, and Jure Leskovec

(which is a must due to the event’s high-frequency nature): a 6-minute conversation extracted on 3 FPS yields more than 1000snapshots which fails all the previous temporal models based onrecurrent structures [20, 21, 28, 35, 37]. Some others either assumea static underlying graph structure, such as those designed fortraffic forecasting [26, 45], or claim to be able to handle dynamicnode attributes but have never been evaluated over the case withhighly dynamic node attributes as those in social interaction net-works [21, 28, 35, 37].

Our contribution in this paper is summarized as the following:• We propose the formulation of social interactions into atemporal network prototype to enable unified representationlearning for the many downstream tasks with minimal levelof knowledge-based feature engineering.

• we propose an end-to-end neural-network-based model, tem-poral network-diffusion convolution networks (TNDCN) thatboth intuitively model the information flow with enhancedgraph convolution and flexibly collect scattered fine-grainedpatterns over long time with special hierarchical pooling.

• We evaluate TNDCN over three different social tasks includ-ing deception, dominance and nervousness detection. Ourmodel consistently outperforms a variety of our baselines.In-depth analysis shows that the learnt coefficients furtheryield interesting insights on interaction patterns.

2 PROBLEM FORMULATIONIn this section, we first introduce notations of static graphs ingeneral. Then, we introduce our problem into the context by furtherformulating it into the prototype of temporal graphs. We use capitalletters 𝑁,𝑀,𝑀 ′ to denote some positive integers.

2.1 General Graph NotationsA static network can be represented as a graph𝐺 = (𝑉 , 𝐸) where𝑉denotes the set of nodes and 𝐸 the set of edges. 𝑁 = |𝑉 |. Networksthat we discuss include both directed and undirected. As an undi-rected graph can be viewed as a special case of directed one, weassume 𝐺 is directed hereafter unless specified.

Graph 𝐺 is associated with adjacency matrix 𝐴 ∈ R𝑁×𝑁 where𝐴𝑢𝑣 = 1 if and only if (𝑢, 𝑣) ∈ 𝐸. Also, we define the diagonal out-degree matrix 𝐷out ∈ R𝑁×𝑁 where its 𝑢-th diagonal component is𝑑out,𝑢 =

∑𝑣∈𝑉 𝐴𝑢𝑣 . The random walk matrix over 𝐺 is defined as

𝑊 = 𝐷−1out𝐴. (1)

2.2 Dynamic Social Interaction NetworksCentral to the prototype’s formulation are two things: nodes, whichrepresent people, and timestamped edges, which represent interac-tions between two people and are usually mapped from people’sevolving eye-gazing or speak-listen relationships. Dynamic per-sonal traits are further associated with nodes and communication-based properties like gazing probabilities are associated to edges.

To record social interactions, the most common practice is toleverage a variety of sensors to record snapshots of interactionscenes of high time resolution. Therefore, we define our data struc-tures using temporal graph snapshots: {𝐺𝑡 }1≤𝑡 ≤𝑇 where𝐺𝑡 = {𝑉𝑡 , 𝐸𝑡 }.We further denote the universal node set 𝑉 =

⋃𝑉𝑡 , ∀𝑡 ∈ [1,𝑇 ], so

that 𝑉 is fixed across different snapshots.

As mentioned, dynamic social interaction networks can be asso-ciated with both dynamic node and/or edge attributes. and dynamicedge attributes. For node attributes, we have {𝑋𝑡 }1≤𝑡 ≤𝑇 , where therow of𝑋𝑡 corresponding to node𝑢’s initial attributes. For edges, wealso allow the network edge associated with a quantified attribute,denoted by a weighted adjacency matrix 𝐴 in generalized form. Forexample, an edge can carry the likelihood of one person looking atthe other, or the wireless signal strength that one’s smart devicereceives from the other’s. Generalization to multi-type edges yieldsmulti-view temporal network, which beyond our discussion of thiswork and left for future work. Note that our definition allows bothedge and node attributes to evolve dynamically though.

Our goal is to learn representations for nodes in these networksthat capture important patterns from their social interaction behav-iors. Once the representations are learned, prediction/inference forcertain tasks can be accomplished by feeding these representationsinto inference blocks for downstream tasks. We claim our approachcan be used for general node-level prediction tasks that requirepatterns to be extracted from dynamic social interaction networks.We demonstrate this capability by considering the following threetasks: detection lying people, dominant people, and nervous peo-ple in a social interaction event. The specific inference blocks andtraining objectives will be specified in Section 5.

3 PROPOSED MODELOur model temporal network-difussion convolutional network(TNDCN) consists of two main components, Network Diffusionof node attributes, and Set-Temporal Convolution-based hierarchi-cal pooling over time, as shown in Figure 1. The obtained nodeembeddings will be further fed into a simple task-driven outputblock to either compute loss (during training) or make inference.The loss is specified by tasks so we introduce it in Section 5.

3.1 Network Diffusion ComponentWepropose using network diffusion process tomost intuitivelymodelthe information flows carried by interactive behaviors in our dy-namic network: given people’s personal traits in a certain snapshotquantified by node attributes 𝑋 (0) ∈ R𝑁×𝑀 , the k-hop networkdiffusion can be written as

𝑋 (𝑘) = (𝑊𝑇 )𝑘𝑋 (0) (2)

where𝑊𝑇 is the transpose of𝑊 . This process is further enhancedby two sets of parameters:

Parameters 𝛽 for making or avoiding interactions. One spe-ciality of social interaction networks is that the behavior to avoidinteractions could be very informative. For example, deceivers tendto avoid gazing at others [32], and some deceivers may tend to beabnormally quiet in front of others [41] due to their low-level self-confidence. However, different phenomena could happen betweena follower and his leader [42]. So we consider graphs correspondingto the original interaction networks and their complement graphssimultaneously. Concretely, for each type of interaction networkwith adjacency matrix 𝐴 1, we also consider the corresponding ad-jacency matrix of the complement network𝐴 = 𝐼 −𝐴 where 𝐼 is the

1We assume the components of𝐴 are normalized within interval [0, 1].

Generic Representation Learning for Dynamic Social Interaction KDD MLG’20, Aug 24–27, 2020, San Diego, CA

Mean Pool

Unpack

Reshape

Network Diffusion

…

Time

…

…… … … …

Final Embedding for Node

Temporal ConvMax Pool

Set-Temporal Convolution

N = 5

L layers:𝑻 →

𝑻𝟐𝑳

…

Θ6

Θ7Θ7

Θ7

Θ7, Θ6: Linear ProjectionSet Pool via Eqn. (9, 10)

Person 1 - 5

𝐴′?@ = 𝛽𝐴?@ + (1 − 𝛽)(1 − 𝐴?@)

𝐴′?@

Propagate via Eqn. (6)

Convolute via Eqn. (7), (8)

Figure 1: Temporal Network-Diffusion Convolution Network: an Illustration.

identity matrix. Then, we introduce another parameter 𝛽 ∈ [0, 1]to merge these two networks to obtain a new adjacency matrix via

𝐴′ = 𝛽𝐴 + (1 − 𝛽)𝐴 = (2𝛽 − 𝐼 )𝐴 + (1 − 𝛽)𝐼 . (3)

Apparently, this parameter 𝛽 have physical implication. A greater 𝛽suggests making interaction is more informative to the predictiontask, while a smaller 𝛽 emphasizes that avoiding interaction may bethe key clue. From now on, we will diffuse node attributes over therandom walk matrix derived from this parameterized network 𝐴′

based on (1), i.e., replace𝑊 in Eqn. 2 by𝑊 ′ = 𝐷 ′−1𝐴′. Admittedly,using graph would lead to a densely connected network whichcan lead to large computation cost in large graphs. However, thisissue would not appear here since dynamic social interactions inour context can hardly involve people of more than hundreds in asingle interaction event.

Parameter Γ𝑘 for different-hop interactions. The model is nowto perform different-step network diffusion. By assigning a groupof learnable parameters {Γ𝑘 }𝑘≥0, where Γ𝑘 is a diagonal matrix forthe hop 𝑘 , we consider the transformation of initial node attributes𝑋 (0) ∈ R𝑁×𝑀 based on network diffusion as

𝐻 =∑𝑘≥0

𝐻 (𝑘)Γ𝑘 =∑𝑘≥0

(𝑊 ′𝑇 )𝑘𝐻 (0)Γ𝑘 , 𝐻 (0) = 𝑓 (𝑋 (0) ) (4)

where 𝑓 (·) : R𝑁×𝑀 → R𝑁×𝑀′could be as simple as identity

mapping (𝑀 ′ = 𝑀) or as complex as multi-layer perceptrons (MLP)that properly transform and normalize initial node attributes. Here,𝑀 ′ is the dimension of output channel. Γ𝑘 ∈ R𝑀′×𝑀′

provides theweights for the 𝑘-hop diffusion. The corresponding 𝑞-th diagonalcomponent, denoted by 𝛾𝑘,𝑞 , is the weight for 𝑞’s output channel. Inpractice, typically only the first several hops could be informativeso we may set a upper bound to the number of hops: 5 ∼ 10stepsprovide good enough results in practice.

Note that the formulation (4) has many implications. Considerthe sequence {𝛾𝑘,𝑞}𝑘≥0 for any 𝑞 and suppose 𝑓 is identity map-ping. From the perspective of graph spectral convolution, {𝛾𝑘,𝑞}𝑘≥0corresponds to weights on different levels of the smoothness of𝑞-th node attributes. Moreover, different fixed formulations of 𝛾𝑘,𝑞also provide different types of ranks of nodes: 𝛾𝑘,𝑞 ∝ 𝛼𝑘 corre-sponds to PageRank [31]; 𝛾𝑘,𝑞 ∝ ℎ𝑘

𝑘! corresponds to heat-kernelPageRank [10]. Extensive feature engineering shows that different

formulations of ranks could be important signals to detect deceiversor leaders among group of people [4, 5]. Our formulation based onlearnable parameters allows for more representation power to covermultiple prediction tasks. Moreover, for model interpretation, as𝑊 ′𝑇 is column stochastic, it will keep the ℓ1-norm of every columnof 𝐻 (𝑘) unchanged (with non-negative features) and thus naturallyhold normalizing property. Therefore, the value |𝛾𝑘,𝑞 | and the signof 𝛾𝑘,𝑞 are naturally interpreted as the effect of 𝑘−hop diffusionof 𝑞-th node attribute to the final representation. Even when wechoose 𝑓 as MLP, decoupling parameters Γ𝑘 on diffusion and pa-rameters on pure transformation of node attributes in 𝑓 (·) keepsthe effect of network diffusion distinguishable, which is useful formodel interpretation.

Note that there could be variants of (4) to further increase modelcomplexity and representation power. By adding nonlinear trans-formation of each step𝐻 (𝑘) before letting it propagate, one may getthe model of graph convolution networks [23]. However, addingnon-linearity per step increases the difficulty for training, whichlimits the steps of propagation to 2-3, and could simultaneously hurtthe interpretation of models. As our experiments do not show anyimprovement based on non-linearity, a simpler model is preferred.Similar gain by removing non-linearity has also been observed inmany recent literatures on graph neural networks [24, 43]. How-ever, to the best of our knowledge, we are the first to show thesuccess of this manner to process dynamic networks.

3.2 Set-Temporal Convolution ComponentTo aggregate node features over time, we propose a method calledSet TCN (S-TCN) to handle the complex and long-term temporalsocial interactions. As mentioned, there are two challenges in de-signing the block: being able to handle a extremely long sequenceof snapshots because of the high time resolution, and being able tocapture local dynamics typically subtle and scattered randomly inthe whole time span. Correspondingly, our S-TCN block consists oftwo components: multi-layer temporal convolution and set pooling.

Multi-layer Temporal Convolution. The input of this block is asequence of node features {𝐻𝑡 }1≤𝑡 ≤𝑇 where 𝐻𝑡 ∈ R𝑁×𝑀 denotesthe node features for each snapshot 𝑡 obtained via (4). The 𝑙-th


Figure 2: Receptive field of temporal convolution: The inter-action happened at the two blue timestamps in layer 𝑙 is cap-tured by the blue timestamps in layers 𝑙 + 1 and 𝑙 + 2 throughconvolution operation.

temporal convolution layer is defined as the following:

𝑍(0)𝑡 = 𝐻𝑡

𝑍(𝑙)𝑡 = ReLu(𝑍 (𝑙−1)

𝑡 ∗𝐶 (𝑙)𝑡 ) = ReLu(

∑𝜏 ∈[1,𝑤 ]

𝑋𝑡−𝜏+1𝐶𝜏 )

𝑍(𝑙)𝑡 = max-pooling({𝑍 (𝑙)

2𝑡 , 𝑍(𝑙)2𝑡+1}), for 1 ≤ 𝑙 ≤ 𝐿

where ∗ is the convolution operator, {𝐶 (𝑙)𝜏 }1≤𝜏≤𝑤 is the convolution

kernel,𝑤 is window size.The number of layers 𝐿 typically depends on the time-scale of

interactions we want to extract patterns from. It is related to thereceptive field of convolution networks (See Fig. 2)). The successof TCN in our setting may be due to its clear and flexible recep-tive fields. If the size of max-pooling kernel is 2 as what we usein the equation, then neurons in the last (𝐿-th) convolution layercan perceive the signals with length 2𝐿 . The size of receptive fieldcan be set based on two important usage: (1) Signal Denoising. con-volution kernels are widely known for their capability to functionas low-pass filters. By stacking different numbers of convolutionlayers, we can explicitly tune the capability of the network for sig-nal smoothing; (2) Temporal feature extraction from well-defined"locality". By tuning the number of layers, one can actively searchfor the optimal receptive field length to gather meaningful features.Such length is also an important reference for us to understandpeople’s interaction.

Given a proper depth of TCN 𝐿 ∈ [2, 4] to obtain a proper sizeof receptive field, the length of the final layer could be very long(≥ 50), because the original time series 𝑇 ≳ 1000 is long.

Set Pooling. As opposed to online social networks that oftenshow seasonal patterns, there are seldom periodical patterns inoffline social interaction networks. Consider eye contact in conver-sion/meeting among a group of people. Informative patterns of in-teractive behaviors of people are usually randomly scattered in thelong time span. Therefore, based on the local patterns captured byTCN,we use set pooling over the obtained sequence {𝑍 (𝐿)

𝑡 }1≤𝑡 ≤𝑇 (𝐿)

to extract the message scattered within this long sequence. We ob-serve that the following form is generally effective across differentapplications: First, we impose max pooling (5) on {𝑍 (𝐿)

𝑡 }1≤𝑡 ≤𝑇 (𝐿)

to emphasize the critical local patterns; Then, we linearly merge theoutput of max pooling into each𝑍 (𝐿)

𝑡 to let each𝑍 (𝐿)𝑡 capture global

information; Finally, after a simple ReLu activation, we obtain the

output via mean pooling.

𝑍max = max-pooling1≤𝑡 ≤𝑇 (𝐿) (𝑍 ′𝑡 ), 𝑍 ′

𝑡 = 𝑍(𝐿)𝑡 (5)

𝑍out = mean-pooling1≤𝑡 ≤𝑇 (𝐿) (ReLu(Θ1𝑍′𝑡 + Θ2𝑍max)) (6)

Note that the max pooling captures the essence of randomly scat-tered patterns while the second step based on linear combinationand the mean pooling is found out to be useful to improve therobustness of feature aggregation. Note that this set-pooling tech-nique properly tailors Deep Sets [46] for our setting.

4 RELATEDWORKThe research related to our problem spans two broad areas:

Methods to Analyze Social Interactions. Lots of research hasbeen conducted to identify human behaviors and relationship dur-ing social interactions such as leadership [8], dominance [5, 22],friendship [9], and deception [4, 14]. These works focus on de-signing task-specific features in short periods and aggregate themto long-term feature vectors via statistical methods. The obtainedfeatures are then fed into standard classifiers (e.g. SVM, RandomForest). These engineered features, although shown to be powerfulin their corresponding tasks, are less general and often requiresspecific domain knowledge in social science and psychology theo-ries (e.g. visual dominance ratio [15], emotions and deception [41]).Moreover, the hand-crafted features becomemore noisy when build-ing upon raw features. To effectively aggregate features over thelong temporal domain, extensive statistical methods are employedsuch as summation, median and variance [8, 14], histograms andbag-of-words [4, 5]. These indifferentiable aggregations make po-tential models untrainable. In contrast, due to our neural-network-based module, we obtain a differentiable model that connects rawnetworks directly to the social tasks and allows for an end-to-endtraining. By taking advantage of this training procedure, the modelnaturally learns the effect of interacting networks for various socialtasks without using extensive statistical analysis.

Representation Learning for Dynamic Networks. The successof representation learning for dynamic social interaction networksstrongly depends on processing the interweaving high-dynamicnode attributes and interactions. So we first partition previous ap-proaches to learn representation of dynamic networks into twocategories regarding whether dynamic node attributes can be pro-cessed. We do not provide a detailed review of works unable totake dynamic node attributes including [1, 11, 20, 29, 38, 39, 47, 48].Works that were claimed to digest dynamic node attributes all workon networks snapshots [21, 28, 35, 37]. They generally follow thepipeline by first propagating node attributes of each network snap-shot and then aggregating them over time. Works [21, 28, 37] usegraph convolution networks [23] for the first step while Sankaret al. [35] leverages graph attention networks [40]. For the sec-ond step, works [21, 28, 37] use RNN and its variants to aggregatenode representations, while Sankar et al. [35] uses self-attentionmechanism. However, all these approaches share the issue of lim-ited memory capacity when #snapshots > 100. Moreover, despiteproposed to process dynamic node attributes, they have not beenevaluated in the setting with highly dynamic node attributes asthose in dynaminc social interaction networks.


5 EMPIRICAL STUDY5.1 Experiment SettingsOverview. Our proposed model is evaluated on three differentnode-level prediction social tasks over 4 datasets:

(1) Dominance Detection: RESISTANCE-1 [6], ELEA [34];(2) Deception Detection: RESISTANCE-2 [6];(3) Nervousness Detection: RESISTANCE-3 [6];

Task (1) and (3) aims to identify the most dominant/nervous personin a social interaction event happening between 3-8 people. Task(2) aims to detect all the hidden liars in a social interaction eventhappening between 5-8 people. Therefore, we consider Task (1), (3)as a one-vs-rest classification problem, and consider Task (2) as abinary classification problem.

In its original format, each dataset is a collection of videos. Eachvideo records a conversation whose duration ranges from around5 to 30 minutes, with different conversation contexts, which wewill introduce slightly afterwards. We preprocess each video usingvision-based and audio-based techniques of various sources, whichfor each conversation generates around 800-4500 dynamic networksnapshots (on 3 FPS, as described in Section 2.2). In the rest of thissection, we will explain the dataset background, ground-truth labelcollection, feature preprocessing, as well as other implementationdetails including baselines and model tuning.

Dataset: RESISTANCE-1, 2, 3.All three dataset are a collection ofvideos and surveys recording people’s performance in a role-playingparty game called the Resistance: Avalon. Each game has 5 to 8players secretly split into two rivaling teams before the game starts:the resistance team ("good" people, Team A, accounting for about70%) and the spy team ("bad" people, Team B, the rest 30%). Team Bknow everyone’s real identity but Team A do not. Both teams’ goalis to beat each other in the “missions” conducted by discussion andelection, which involves frequent deception behavior(presumablyonly from Team B) and argument, query and persuasion (from allparties). In order to persuade, very often people tend to be dominantand avoid appearing nervous, Please see supplementary material2for more background. The three dataset share about 50% videos incommon while for the rest they each differ slightly due to severalpractical reasons such as label missing or mismatch.

Labels for RESISTANCE-1, 3 are generated by referencing sur-veys taken by all participants after each game. The surveys takethe form of questionnaires, asking each participant to rate the dom-inance and nervous level of the other people. We treat the medianscore of each person (rated by others) as its ground truth score, tiesbroken by further comparing the mean. Labels for RESISTANCE-2,which are Team A & B’s identity of each game, are given by thedataset. Considering the gaming rules, it is presumable that TeamB (spies) are essentially lying throughout the game.

Dataset: ELEA The dataset [34] is a widely used benchmark formodeling and detecting personal traits such as dominance [5]. Thedataset can be accessed here3. In each video, 3-4 participants col-laboratively performed a "winter survival task" by having peacefuldiscussions. We perform dominance detection task on the datasetas other labels are unavailable

2Supplementary material:https://bit.ly/36ipjoO3ElEA dataset: https://www.idiap.ch/dataset/elea

Labels are generated in a slightly different way following theprotocols of [2, 5]: we are detecting more dominant people insteadof the most. This also provides a new angle of evaluation. Thereare two categories of dominance scores: (1) perceived dominance(PDom), which is scores rated by the game organizers who hostedand monitored the game; (2) ranked dominance (RDom), which isscores rated by game participants to each other. we assign binarydominance labels to people by thresholding their dominance scoresby the median values of people in each video.

Feature Extraction. We extract the following social interactionfeatures from videos on frequency of 3 FPS using a combination oftoolkit:

(1) Emotion: intensity of eight emotions + two facial traits (smile,open eyes) by Amazon Recognition;

(2) FAU: intensity of 17 facial action units using OpenFace [7];(3) MFCC: voice features widely used in audio analysis [12];(4) Speak Prob.: Probability a person is speaking estimated from

lip movement [17];(5) Gazing Prob.: probability that each person looks at each

other players estimated from eye movement [3]. For eachperson his Gazing Prob. towards other people sums up to 1.

Our dynamic social interaction network is constructed from thelast feature.

Output Layer & Loss.Asmentioned, we consider Task (2) a binaryclassification problem, so the output block for processing eachperson’s representation is a simple logistic regression instantiatedby a densely connected neural network layer plus the Sigmoidnonlinearity.

For task (1) and (3) to select one out of a set 3-8 representationswe further use the following transformation: Let 𝑍𝑜𝑢𝑡 ∈ R𝑁×𝑒

taken from Eqn. 6 be the learned representation of all the peoplein one interaction event, where 𝑁 ∈ [3, 8] is the number of peoplead 𝑒 is the representation length). The 𝑖-th people’s output logit isgiven by:

𝑍𝑖 = 𝑍𝑖𝑊1 +Mean-pooling({𝑍𝑖 }1≤𝑖≤𝑁 )𝑊2, for 1 ≤ 𝑖 ≤ 𝑁 (7)

𝑌 = softmax(𝑍𝑊3) (8)

where𝑊1,𝑊2 ∈ R𝑒×𝑒 and𝑊3 ∈ R𝑒×1 are projection matrices whoseparameters are to be learned. We use cross entropy loss for alltasks and back-propagate the errors to all aforementioned learnableparameters for optimization.

Baselines. Our framework is compared with two sets of baselines.The first set is task-specific baselines proposed uniquely for eachtask:

• Dominance Detection: MKL [8], which is based on hand-crafted features like voice pitch and speaking rate; two ver-sions of the GDP [5] which primarily relies on their hand-crafted feature called DomRank: GDP with random forestclassifier (GDP-RF), and GDP with multi-layer perceptronclassifier (GDP-MLP); DELF [5] also uses DomRank.

• Deception Detection: ADD [44] based on handcrafted mi-cro facial expression and NLP features; TGCN-L [27] basedon gazing probabilities, and LiarRank [4] based on all thefeatures we used but aggregating them in a special manner;

https://bit.ly/36ipjoO

https://www.idiap.ch/dataset/elea


Task No. and Task Dataset Dynamic Network Sequences Time Steps (Avg.) Nodes Interactions†

(1) Dominance Identification RESISTANCE-1 956 2, 514 4780 4.007 × 106

(1) Dominance Identification ELEA 21 1, 350 84 6.474 × 103

(2) Deception Detection RESISTANCE-2 2, 157 1, 800 10, 785 2.439 × 107

(3) Nervousness Detection RESISTANCE-3 1, 097 1, 800 5, 485 4.910 × 107

* †We count all the interactions with gazing probability ≥ 0.5.Table 1: General statistics of the dynamic networks used for representation learning.

• Nervousness Detection: Facial Cues [19] based on facial ac-tion units, Random Forest and Logistic Regression4;

In addition, to ensure fair comparison, we further pick for eachtask the best reported baseline to further evaluate them on othertasks. Furthermore, We add one more generic temporal GNN-basedmodel was SOTA on sequence modeling: GCN-LSTM [37]. It com-bines graph convolutional network with LSTM. For more detailson features and aggregation schemes used by each baseline, pleasesee our supplementary material.

Model training. Following protocols of most baselines, we ran-domly partition the total number of dynamic social interactionnetworks in each dataset into 10 folds. Each time, a different 10% isreserved for testing, and the rest for training. We use cross-entropyloss and Adam optimizer to train our TNDCN model. Hyperparam-eters are determined using grid search and their detailed values areprovided in supplementary material. The whole pipeline is imple-mented in PyTorch. Following protocols of most previous work, wereport the average accuracy on test set by running the pipeline for10 times using different random seed to initialize parameters.

5.2 Performance ComparisonTable 2 compares performance of our model and other baselines onthe three tasks. It shows that our model significantly outperformsthe strongest baselines in Task 2,3 and also achieves comparableresult with the strongest baseline on Task 1. Notice that our bestmodel achieves this with less than one-fifth parameters of DELFand GDP. We also notice that our model has more significant gainin the challenging tasks 2 and 3. The two tasks are challengingessentially because they are both probing for something that theinteraction participants are purposely concealing. For such caseswe explain that effectively capturing the temporal cues is the keyto success. While almost all the baselines come with proper graphconvolution or careful feature engineering work, their process-ing of temporal information falls insufficient, either mean pooling(TGCN, Logistic., Random F.), Fisher Vector encoding (FacialCues),histogram encoding (DELF), or many-to-one LSTM (GCN-LSTM).Also notice the GCN-LSTM’s failure on most tasks, which showsthe insufficiency of temporal sequence modeling techniques basedon recurrent structures.

4Since very little previous literature exists on this task, we further implement twobaselines that use a simple classifier to process all our input features independently,i.e. without any network-level operation

5.3 Model InterpretationThe linearity of parameters in network diffusion provides modelinterpretation. Next, we investigate these parameters and explainthe induced social insights.

Interpretation I: Balancing Weight 𝛽 . Recall from Section 3.1that 𝛽 is the learnable parameter that directly controls the relativeimportance of proactive interaction versus avoidance of interac-tion. Figure 3 displays how the 𝛽 converges during the training(initialized to 0.5, i.e. neutral). For each task we ran multiple times

AvoidingInteraction

SeekingInteraction

Figure 3: Different convergence behaviors of 𝛽 duringtraining, 95% confidence intervals shaded. Trained onRESISTANCE-1,2,3 respectively.by introducing small perturbation to 𝛽’s initialization. The figureshows that the parameter exhibits very different convergence be-havior across different tasks. In particular, the deception detectiontask 𝛽 significantly drops to around 0.2, which indicates that avoid-ance of interaction may be much more important than contactsin detecting deception. Interestingly, this phenomenon coincideswith findings from a psychological study [32] on eye movement ofpeople in various contexts. It is also quite understandable that dom-inant people are more easily identified with their aggressive way ofreaching out to people (rather than escape to do so). Nervous people,on the other hand, seem to be identified with a mixture of the twoextremes. The analysis on 𝛽 shows its prediction power and highvalue to understanding human’s avoidance of social interactions.

Interpretation II: Diffusion Weights {Γ𝑘 }. Recall from Equa-tion 4 that {Γ𝑘 }0≤𝑘≤𝐾 is a sequence of diagonal matrices whereΓ𝑘 ∈ R𝑀′×𝑀′

contains the weights corresponding to𝑀 ′ features’ 𝑘-hop diffusion. Therefore, when we finish training we would obtain𝐾 + 1 diffusion weights for each of the𝑀 ′ features. Analyzing thesediffusion weights provides us with important insights of how the in-teraction network actually helps shape the original features duringthe diffusion. Fig. 4 displays such weights for four of the features’(here diffusion steps 𝐾 = 10) after being trained on the nervousness




GroupTask Dominance-R* Dominance-E* Deception Nervousness

Task-specificBaselines

MKL 0.879 Aran et al. 0.657/0.598 FAU 0.608 R. Forest 0.678GDP-MLP 0.917 Okada et al. 0.677/0.686 TGCN-L 0.550 Logistic 0.493GDP-RF 0.878 Humans 0.686/- ADD 0.632 - -

"Universal"Baselines†

DELF 0.889 DELF 0.765/0.677 DELF 0.641 DELF 0.721LiarRank 0.827 LiarRank 0.674/0.612 LiarRank 0.665 LiarRank 0.731FacialCues 0.746 FacialCues 0.702/0.665 FacialCues 0.591 FacialCues 0.733GCN-LSTM 0.821 GCN-LSTM 0.732/0.585 GCN-LSTM 0.562 GCN-LSTM 0.629

Ours TNDCN 0.923±0.009 TNDCN 0.774±0.038/0.726±0.022 TNDCN 0.689±0.021 TNDCN 0.769±0.023

* Dominance-R and Dominance-E were done on RESISTANCE-1 and ELEA respectively.For Dominance-E, results on both PDom and RDom labels are reported.Table 2: Performace comparison on three tasks: detecting dominance, deception and nervousness.

detection task. The diffusion weights have been normalized such

Figure 4: Diffusion step weights of different features for thetask of identifying the most nervous person in an interac-tion setting, 95% confidence interval shaded.

that the 0-hop weight is 1. First, we observe that the 0-hop weightis significantly the largest, meaning that the original node featuresare very important to prediction. Therefore, the role of graph diffu-sion in this task can be roughly regarded as a fine-tuning processover the original features. Second, a clear contrast between thebottom two features is observed. While both of them can propagatequite long via the interactions, the way diffusion modifies the orig-inal features are quite different. We attribute such distinction tothe two features’ different real-world implications: node in-degreefeature in a gazing network snapshot can be interpreted as theattention one received from other people at the corresponding mo-ment, and thus indicates the interaction engagement level of thatperson. In contrast, the node self-loop feature can be interpretedas the probability that the person looked at his/her own screen atthe corresponding moment, making the person look introvertedand preservative. The different practical meanings entailed by thefeature determines they are propagated via the interaction networkin distinctive manners. Finally, the "smile" and "happy" emotionseem to be able to diffuse two steps while not beyond.

5.4 Ablation studyAblation study on Task 1 is further conducted on RESISTANCE-1 toevaluate the usefulness of TNDCN building blocks independently.

No. Replacement Dominance

1 Original 0.9232 S-TCN → LSTM 0.7583 S-TCN →Mean Pooling 0.8424 Diffusion → None 0.8295 Diffusion → GCN-1* 0.8446 Diffusion → GCN-2 0.8897 Diffusion → GCN-4 0.7848 Diffusion → GCN-6 0.7019 Set 𝛽 = 1 0.889

* GCN for 1 layer, similar hereafter.Table 3: Ablation and comparison studywith deception detection task (metric:Mean Accuracy).

As shown by table 3, Eval. 2-3 further verify RNN’s insufficiency onhandling both extremely long time sequence and weak local dynam-ics. Interestingly, the very simple mean pooling can significantlyoutperform RNN and achieve close result even when comparedwith our set pooling technique. Eval. 4-8 focus on the graph-leveltechniques. Eval. 4 shows the importance of using network forprediction. Eval. 5-8 indicates the usability of GCN despite its seri-ous decay because of over-smoothing and too many nonlinearitieswhen going deep. In contrast, our network diffusion can propagateas long as 10 hops without significant decay in performance.

5.5 Further Scope StudyWhile we claim that TNDCN is especially helpful to deal with in-teraction sequences that are extremely long and high-frequent, oneinteresting question to ask is how it will perform if the sequence isrelatively short and the dynamics are less frequent? We investigatethis problem by further running TNGCN on CIAW [18], a datasetrecording more than 92 people’s timestamped proximities (of upto 1.2 meters) in a workplace over 20 days. The goal is to infereach person’s department based solely on their interaction data.Notice that there is only one dynamic network, which we partition


into only 20 snapshots. Since no previous works were done on wefocus on comparing with generic temporal GNN models. Please seesupplementary material for more details of the settings and results.Our conclusion is that our model is still able to perform quite welleven in this special scenario. We attribute this to the high flexibilityof our S-TCN module to deal with sequences of various lengths.

6 CONCLUSIONIn this work, we introduced a new neural-network-based represen-tation learning model, TNDCN, which is particularly designed fordynamic social interaction networks. Dynamic social interactionnetworks contain highly dynamic node attributes with interactionswith duration, which makes previous dynamic network embeddingapproaches not applicable. Our TNDCN model contains a networkdiffusion block that is capable of extracting patterns from complexinterweaving of highly dynamic node attributes and interaction.It also leverages combination of TCN and set pooling that maylearn the subtle and local patterns of social interations randomlyscattered in a long time span. TNDCN has been evaluated on threenode-level prediction social tasks outperformed all previous base-lines. The learned coefficients of TNDCN also give interesting socialinsights. Overall, TNDCN provides social scientists a powerful toolto automatically analyze social interactions to solve social tasksand extract knowledge for social science.

REFERENCES[1] [n.d.].[2] Oya Aran and Daniel Gatica-Perez. 2013. One of a kind: Inferring personality im-

pressions in meetings. In Proceedings of the 15th ACM on International conferenceon multimodal interaction. 11–18.

[3] Sileye O Ba and Jean-Marc Odobez. 2010. Multiperson visual focus of attentionfrom head pose and meeting contextual cues. IEEE Transactions on PatternAnalysis and Machine Intelligence 33, 1 (2010), 101–116.

[4] Chongyang Bai, Maksim Bolonkin, Judee Burgoon, Chao Chen, Norah Dunbar,Bharat Singh, VS Subrahmanian, and Zhe Wu. 2019. Automatic Long-TermDeception Detection in Group Interaction Videos. In 2019 IEEE InternationalConference on Multimedia and Expo (ICME). IEEE, 1600–1605.

[5] Chongyang Bai, Maksim Bolonkin, Srijan Kumar, Jure Leskovec, Judee Burgoon,Norah Dunbar, and VS Subrahmanian. 2019. Predicting dominance in multi-person videos. In Proceedings of the 28th International Joint Conference on ArtificialIntelligence. AAAI Press, 4643–4650.

[6] Chongyang Bai, Srijan Kumar, Jure Leskovec, Miriam Metzger, Jay F. Nunamaker,and V. S. Subrahmanian. 2019. Predicting the Visual Focus of Attention in Multi-Person Discussion Videos. In Proceedings of the Twenty-Eighth International JointConference on Artificial Intelligence, IJCAI-19. International Joint Conferences onArtificial Intelligence Organization, 4504–4510. https://doi.org/10.24963/ijcai.2019/626

[7] Tadas Baltrušaitis, Peter Robinson, and Louis-Philippe Morency. 2016. Openface:an open source facial behavior analysis toolkit. In 2016 IEEE Winter Conferenceon Applications of Computer Vision (WACV). IEEE, 1–10.

[8] Cigdem Beyan, Francesca Capozzi, Cristina Becchio, and Vittorio Murino. 2017.Prediction of the leadership style of an emergent leader using audio and visualnonverbal features. IEEE Transactions on Multimedia 20, 2 (2017), 441–456.

[9] Eunjoon Cho, Seth A Myers, and Jure Leskovec. 2011. Friendship and mobility:user movement in location-based social networks. In Proceedings of the 17thACM SIGKDD international conference on Knowledge discovery and data mining.1082–1090.

[10] Fan Chung. 2007. The heat kernel as the pagerank of a graph. Proceedings of theNational Academy of Sciences 104, 50 (2007), 19735–19740.

[11] Hanjun Dai, Yichen Wang, Rakshit Trivedi, and Le Song. 2016. Deep coevolu-tionary network: Embedding user and item features for recommendation. arXivpreprint arXiv:1609.03675 (2016).

[12] Steven Davis and Paul Mermelstein. 1980. Comparison of parametric representa-tions for monosyllabic word recognition in continuously spoken sentences. IEEEtransactions on acoustics, speech, and signal processing 28, 4 (1980), 357–366.

[13] Sergey Demyanov, James Bailey, Kotagiri Ramamohanarao, and ChristopherLeckie. 2015. Detection of deception in the mafia party game. In Proceedings ofthe 2015 ACM on International Conference on Multimodal Interaction. 335–342.

[14] Sergey Demyanov, James Bailey, Kotagiri Ramamohanarao, and ChristopherLeckie. 2015. Detection of Deception in the Mafia Party Game. In ACM ICMI.

[15] John F Dovidio and Steve L Ellyson. 1982. Decoding visual dominance: Attri-butions of power based on relative percentages of looking while speaking andlooking while listening. Social Psychology Quarterly (1982), 106–113.

[16] Nathan Eagle, Alex Sandy Pentland, and David Lazer. 2009. Inferring friendshipnetwork structure by using mobile phone data. Proceedings of the nationalacademy of sciences 106, 36 (2009), 15274–15278.

[17] Sergio Escalera, Xavier Baró, Jordi Vitria, Petia Radeva, and Bogdan Raducanu.2012. Social network extraction and analysis based on multimodal dyadic inter-action. Sensors 12, 2 (2012), 1702–1719.

[18] Mathieu Génois, Christian L Vestergaard, Julie Fournet, André Panisson, IsabelleBonmarin, and Alain Barrat. 2015. Data on face-to-face contacts in an officebuilding suggest a low-cost vaccination strategy based on community linkers.Network Science 3, 3 (2015), 326–347.

[19] G Giannakakis, Matthew Pediaditis, Dimitris Manousos, Eleni Kazantzaki, FrancoChiarugi, Panagiotis G Simos, Kostas Marias, and Manolis Tsiknakis. 2017. Stressand anxiety detection using facial cues from videos. Biomedical Signal Processingand Control 31 (2017), 89–101.

[20] Palash Goyal, Sujit Rokka Chhetri, and Arquimedes Canedo. 2020. dyngraph2vec:Capturing network dynamics using dynamic graph representation learning.Knowledge-Based Systems 187 (2020), 104816.

[21] Ehsan Hajiramezanali, Arman Hasanzadeh, Krishna Narayanan, Nick Duffield,Mingyuan Zhou, and Xiaoning Qian. 2019. Variational graph recurrent neuralnetworks. In Advances in Neural Information Processing Systems. 10700–10710.

[22] Dinesh Babu Jayagopi, Hayley Hung, Chuohao Yeo, and Daniel Gatica-Perez.2009. Modeling dominance in group conversations using nonverbal activitycues. IEEE Transactions on Audio, Speech, and Language Processing 17, 3 (2009),501–513.

[23] Thomas N Kipf and MaxWelling. 2016. Semi-supervised classification with graphconvolutional networks. arXiv preprint arXiv:1609.02907 (2016).

[24] Johannes Klicpera, Aleksandar Bojchevski, and Stephan Günnemann. 2018. Pre-dict then propagate: Graph neural networks meet personalized pagerank. arXivpreprint arXiv:1810.05997 (2018).

[25] David Lazer, Alex Pentland, Lada Adamic, Sinan Aral, Albert-László Barabási,Devon Brewer, Nicholas Christakis, Noshir Contractor, James Fowler, MyronGutmann, et al. 2009. Computational social science. Science 323, 5915 (2009),721–723.

[26] Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. 2017. Diffusion convolu-tional recurrent neural network: Data-driven traffic forecasting. arXiv preprintarXiv:1707.01926 (2017).

[27] Yozen Liu, Xiaolin Shi, Lucas Pierce, and Xiang Ren. 2019. Characterizingand Forecasting User Engagement with In-app Action Graph: A Case Studyof Snapchat. In Proceedings of the 25th ACM SIGKDD International Conference onKnowledge Discovery & Data Mining. 2023–2031.

[28] Franco Manessi, Alessandro Rozza, and Mario Manzo. 2020. Dynamic graphconvolutional networks. Pattern Recognition 97 (2020), 107000.

[29] Giang Hoang Nguyen, John Boaz Lee, Ryan A Rossi, Nesreen K Ahmed, EunyeeKoh, and Sungchul Kim. 2018. Continuous-time dynamic network embeddings.In Companion Proceedings of the The Web Conference 2018. 969–976.

[30] Shogo Okada, Laurent Son Nguyen, Oya Aran, and Daniel Gatica-Perez. 2019.Modeling dyadic and group impressions with intermodal and interperson features.ACM Transactions on Multimedia Computing, Communications, and Applications(TOMM) 15, 1s (2019), 1–30.

[31] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. Thepagerank citation ranking: Bringing order to the web. Technical Report. StanfordInfoLab.

[32] Keith Rayner. 1998. Eye movements in reading and information processing: 20years of research. Psychological bulletin 124, 3 (1998), 372.

[33] Rudolph J Rummel. 1976. Understanding conflict and war: vol. 2: the conflicthelix. Bev-erly Hills: Sage (1976).

[34] Dairazalia Sanchez-Cortes, Oya Aran, Marianne Schmid Mast, and Daniel Gatica-Perez. 2011. A nonverbal behavior approach to identify emergent leaders in smallgroups. IEEE Transactions on Multimedia 14, 3 (2011), 816–832.

[35] Aravind Sankar, Yanhong Wu, Liang Gou, Wei Zhang, and Hao Yang. 2020.DySAT: Deep Neural Representation Learning on Dynamic Graphs via Self-Attention Networks. In Proceedings of the 13th International Conference on WebSearch and Data Mining. 519–527.

[36] Atsushi Senju and Mark H Johnson. 2009. The eye contact effect: mechanismsand development. Trends in cognitive sciences 13, 3 (2009), 127–134.

[37] Youngjoo Seo, Michaël Defferrard, Pierre Vandergheynst, and Xavier Bresson.2018. Structured sequence modeling with graph convolutional recurrent net-works. In International Conference on Neural Information Processing. Springer,362–373.

[38] Aynaz Taheri, Kevin Gimpel, and Tanya Berger-Wolf. 2019. Learning to Repre-sent the Evolution of Dynamic Graphs with Recurrent Models. In CompanionProceedings of The 2019 World Wide Web Conference. 301–307.


https://doi.org/10.24963/ijcai.2019/626

https://doi.org/10.24963/ijcai.2019/626


[39] Rakshit Trivedi, Mehrdad Farajtabar, Prasenjeet Biswal, and Hongyuan Zha.2019. DyRep: Learning Representations over Dynamic Graphs. In InternationalConference on Learning Representations.

[40] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, PietroLio, and Yoshua Bengio. 2017. Graph attention networks. arXiv preprintarXiv:1710.10903 (2017).

[41] Aldert Vrij. 2008. Detecting lies and deceit: Pitfalls and opportunities. John Wiley& Sons.

[42] Fred O Walumbwa and John Schaubroeck. 2009. Leader personality traits andemployee voice behavior: mediating roles of ethical leadership and work grouppsychological safety. Journal of applied psychology 94, 5 (2009), 1275.

[43] Felix Wu, Tianyi Zhang, Amauri Holanda de Souza Jr, Christopher Fifty, Tao Yu,and Kilian QWeinberger. 2019. Simplifying graph convolutional networks. arXivpreprint arXiv:1902.07153 (2019).

[44] Zhe Wu, Bharat Singh, Larry S Davis, and VS Subrahmanian. 2018. Deceptiondetection in videos. In Thirty-Second AAAI Conference on Artificial Intelligence.

[45] Bing Yu, Haoteng Yin, and Zhanxing Zhu. 2017. Spatio-temporal graph con-volutional networks: A deep learning framework for traffic forecasting. arXivpreprint arXiv:1709.04875 (2017).

[46] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ RSalakhutdinov, and Alexander J Smola. 2017. Deep sets. In Advances in neuralinformation processing systems. 3391–3401.

[47] Lekui Zhou, Yang Yang, Xiang Ren, Fei Wu, and Yueting Zhuang. 2018. Dynamicnetwork embedding by modeling triadic closure process. In Thirty-Second AAAIConference on Artificial Intelligence.

[48] Yuan Zuo, Guannan Liu, Hao Lin, Jia Guo, Xiaoqian Hu, and Junjie Wu. 2018.Embedding temporal network via neighborhood formation. In Proceedings ofthe 24th ACM SIGKDD International Conference on Knowledge Discovery & DataMining. 2857–2866.

generic representation learning for dynamic social interaction · generic representation learning...

Documents