multi-modal sarcasm detection in twitter with hierarchical ...multi-modal sarcasm detection in...

10
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2506–2515 Florence, Italy, July 28 - August 2, 2019. c 2019 Association for Computational Linguistics 2506 Multi-Modal Sarcasm Detection in Twitter with Hierarchical Fusion Model Yitao Cai, Huiyu Cai and Xiaojun Wan Institute of Computer Science and Technology, Peking University The MOE Key Laboratory of Computational Linguistics, Peking University Center for Data Science, Peking University {caiyitao,hy cai,wanxiaojun}@pku.edu.cn Abstract Sarcasm is a subtle form of language in which people express the opposite of what is implied. Previous works of sarcasm detection focused on texts. However, more and more social me- dia platforms like Twitter allow users to cre- ate multi-modal messages, including texts, im- ages, and videos. It is insufficient to detect sar- casm from multi-model messages based only on texts. In this paper, we focus on multi- modal sarcasm detection for tweets consisting of texts and images in Twitter. We treat text features, image features and image attributes as three modalities and propose a multi-modal hierarchical fusion model to address this task. Our model first extracts image features and at- tribute features, and then leverages attribute features and bidirectional LSTM network to extract text features. Features of three modal- ities are then reconstructed and fused into one feature vector for prediction. We cre- ate a multi-modal sarcasm detection dataset based on Twitter. Evaluation results on the dataset demonstrate the efficacy of our pro- posed model and the usefulness of the three modalities. 1 Introduction Merriam Webster defines sarcasm as “a mode of satirical wit depending for its effect on bitter, caus- tic, and often ironic language that is usually di- rected against an individual”. It has the magi- cal power to disguise the hostility of the speaker (Dews and Winner, 1995) while enhancing the ef- fect of mockery or humor on the listener. Sarcasm is prevalent on today’s social media platforms, and its automatic detection bears great significance in customer service, opinion mining, online harass- ment detection and all sorts of tasks that require knowledge of people’s real sentiment. Twitter has become a focus of sarcasm detec- tion research due to its ample resources of pub- licly available sarcastic posts. Previous works on Twitter sarcasm detection focus on the text modal- ity and propose many supervised approaches, in- cluding conventional machine learning methods with lexical features (Bouazizi and Ohtsuki, 2015; Pt´ cek et al., 2014), and deep learning methods (Wu et al., 2018; Baziotis et al., 2018). However, detecting sarcasm with only text modality can never be certain of the true intention of the simple tweet “What a wonderful weather!” until the dark clouds in the attached picture (Fig- ure 1(a)) are seen. Images, while are ubiquitous on social platforms, can help reveal (Figure 1(a)), af- firm (Figure 1(b)) or disprove the sarcastic nature of tweets, thus are intuitively crucial to Twitter sar- casm detection tasks. In this work, we propose a multi-modal hier- archical fusion model for detecting sarcasm in Twitter. We leverage three types of features, namely text, image and image attribute features, and fuse them in a novel way. During early fu- sion, the attribute features are used to initialize a bi-directional LSTM network (Bi-LSTM), which is then used to extract the text features. The three features then undergo representation fusion, where they are transformed into reconstructed rep- resentation vectors. A modality fusion layer per- forms weighted average to the vectors and pumps them to a classification layer to yield the final re- sult. Our results show that all three types of fea- tures contribute to the model performance. Fur- thermore, our fusion strategy successfully refines the representation of each modality and is signif- icantly more effective than simply concatenating the three types of features. Our main contributions are summarized as fol- lows: We propose a novel hierarchical fusion model to address the challenging multi-modal sar-

Upload: others

Post on 05-Jan-2020

13 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Multi-Modal Sarcasm Detection in Twitter with Hierarchical ...Multi-Modal Sarcasm Detection in Twitter with Hierarchical Fusion Model Yitao Cai, Huiyu Cai and Xiaojun Wan Institute

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2506–2515Florence, Italy, July 28 - August 2, 2019. c©2019 Association for Computational Linguistics

2506

Multi-Modal Sarcasm Detection in Twitter with Hierarchical FusionModel

Yitao Cai, Huiyu Cai and Xiaojun WanInstitute of Computer Science and Technology, Peking University

The MOE Key Laboratory of Computational Linguistics, Peking UniversityCenter for Data Science, Peking University

{caiyitao,hy cai,wanxiaojun}@pku.edu.cn

Abstract

Sarcasm is a subtle form of language in whichpeople express the opposite of what is implied.Previous works of sarcasm detection focusedon texts. However, more and more social me-dia platforms like Twitter allow users to cre-ate multi-modal messages, including texts, im-ages, and videos. It is insufficient to detect sar-casm from multi-model messages based onlyon texts. In this paper, we focus on multi-modal sarcasm detection for tweets consistingof texts and images in Twitter. We treat textfeatures, image features and image attributesas three modalities and propose a multi-modalhierarchical fusion model to address this task.Our model first extracts image features and at-tribute features, and then leverages attributefeatures and bidirectional LSTM network toextract text features. Features of three modal-ities are then reconstructed and fused intoone feature vector for prediction. We cre-ate a multi-modal sarcasm detection datasetbased on Twitter. Evaluation results on thedataset demonstrate the efficacy of our pro-posed model and the usefulness of the threemodalities.

1 Introduction

Merriam Webster defines sarcasm as “a mode ofsatirical wit depending for its effect on bitter, caus-tic, and often ironic language that is usually di-rected against an individual”. It has the magi-cal power to disguise the hostility of the speaker(Dews and Winner, 1995) while enhancing the ef-fect of mockery or humor on the listener. Sarcasmis prevalent on today’s social media platforms, andits automatic detection bears great significance incustomer service, opinion mining, online harass-ment detection and all sorts of tasks that requireknowledge of people’s real sentiment.

Twitter has become a focus of sarcasm detec-tion research due to its ample resources of pub-

licly available sarcastic posts. Previous works onTwitter sarcasm detection focus on the text modal-ity and propose many supervised approaches, in-cluding conventional machine learning methodswith lexical features (Bouazizi and Ohtsuki, 2015;Ptacek et al., 2014), and deep learning methods(Wu et al., 2018; Baziotis et al., 2018).

However, detecting sarcasm with only textmodality can never be certain of the true intentionof the simple tweet “What a wonderful weather!”until the dark clouds in the attached picture (Fig-ure 1(a)) are seen. Images, while are ubiquitous onsocial platforms, can help reveal (Figure 1(a)), af-firm (Figure 1(b)) or disprove the sarcastic natureof tweets, thus are intuitively crucial to Twitter sar-casm detection tasks.

In this work, we propose a multi-modal hier-archical fusion model for detecting sarcasm inTwitter. We leverage three types of features,namely text, image and image attribute features,and fuse them in a novel way. During early fu-sion, the attribute features are used to initialize abi-directional LSTM network (Bi-LSTM), whichis then used to extract the text features. Thethree features then undergo representation fusion,where they are transformed into reconstructed rep-resentation vectors. A modality fusion layer per-forms weighted average to the vectors and pumpsthem to a classification layer to yield the final re-sult. Our results show that all three types of fea-tures contribute to the model performance. Fur-thermore, our fusion strategy successfully refinesthe representation of each modality and is signif-icantly more effective than simply concatenatingthe three types of features.

Our main contributions are summarized as fol-lows:

• We propose a novel hierarchical fusion modelto address the challenging multi-modal sar-

Page 2: Multi-Modal Sarcasm Detection in Twitter with Hierarchical ...Multi-Modal Sarcasm Detection in Twitter with Hierarchical Fusion Model Yitao Cai, Huiyu Cai and Xiaojun Wan Institute

2507

(a)“What a wonderful weather!” (b)“Yep, totally normal <user>. Nothing is off aboutthis. Nothing at all. #itstoohotalready

#climatechangeisreal”

Figure 1: Examples of image modality aiding sarcasm detection. (a) The image is necessary for the sarcasm to bespotted due to the contradiction of dark clouds in the image and “wonderful weather” in the text; (b) The imageaffirms the sarcastic nature of the tweet by showing the weather is actually very “hot” and is not at all “totallynormal”.

casm detection task in Twitter. To the bestof our knowledge, we are the first to deeplyfuse the three modalities of image, attributeand text, rather than naıve concatenation, forTwitter sarcasm detection.

• We create a new dataset for multi-modalTwitter sarcasm detection and release it1.

• We quantitatively show the significance ofeach modality in Twitter sarcasm detection.We further show that to fully unleash the po-tential of images, we would need to considerimage attributes - a high-level abstract infor-mation bridging the gap between texts andimages.

2 Related Works

2.1 Sarcasm Detection

Various methods have been proposed for sarcasmdetection from texts. Earlier methods extractcarefully engineered discrete features from texts(Davidov et al., 2010; Riloff et al., 2013; Ptaceket al., 2014; Bouazizi and Ohtsuki, 2015), in-cluding n-grams, word’s sentiment, punctuations,emoticons, part-of-speech tags, etc. More re-cently, researchers leverage the powerful tech-niques of deep learning to get more precise seman-tic representations of tweet texts. Ghosh and Veale(2016) propose a model with CNN and RNN lay-ers. Besides the tweet content in question, contex-tual features such as historical behaviors of the au-thor and the audience serve as a good indicator for

1https://github.com/headacheboy/data-of-multimodal-sarcasm-detection

sarcasm. Bamman and Smith (2015) make use ofhuman-engineered author, audience and responsefeatures to promote sarcasm detection. Zhang,Zhang and Fu (2016) concatenate target tweet em-beddings(obtained by a Bi-GRU model) with man-ually engineered contextual features, and showfair improvement compared to completely feature-based systems. Amir et al. (2016) exploit trainableuser embeddings to enhance the performance of aCNN classification model. Poria et al. (2016) usethe concatenated output of CNNs trained on tweetsand pre-trained on emotion, sentiment, personalityas the inputs for the final SVM classifier. Y. Tay etal. (2018) come up with a novel multi-dimensionalintra-attention mechanism to explicitly model con-trast and incongruity. Wu et al. (2018) constructa multi-task model with densely connected LSTMbased on embeddings, sentiment features and syn-tactic features. Baziotis et al. (2018) ensemblea word based bidirectional LSTM and a characterbased bidirectional LSTM to capture both seman-tic and syntactic features.

However, little has been revealed by far on howto effectively combine textual and visual informa-tion to boost performance of Twitter sarcasm de-tection. Schifanella et al. (2016) simply concate-nate manually designed features or deep learningbased features of texts and images to make predic-tion with two modalities. Different from this work,we propose a hierarchical fusion model to deeplyfuse three modalities.

2.2 Other Multi-Modal Tasks

Sentiment analysis is a related task with sarcasmdetection. Many researches on multi-modal sen-

Page 3: Multi-Modal Sarcasm Detection in Twitter with Hierarchical ...Multi-Modal Sarcasm Detection in Twitter with Hierarchical Fusion Model Yitao Cai, Huiyu Cai and Xiaojun Wan Institute

2508

timent analysis deal with video data (Wang et al.,2016; Zadeh et al., 2017), where text, image andaudio data can usually be aligned and support eachother. Though inputs are different, their fusionmechanisms can be inspiring to our task. Poria,Cambria, and Gelbukh (2015) use multiple kernellearning to fuse different modalities. Zadeh et al.(2017) build their fusion layer by outer product in-stead of simple concatenation in order to get morefeatures. Gu et al. (2018b) align text and audioat word level and apply several attention mecha-nisms. Gu et al. (2018a) first introduce modal-ity fusion structure attempting to reveal the actualimportance of multiple modalities, but their meth-ods are quite different from our hierarchical fusiontechniques.

Inspiration can also be drawn from other multi-modal tasks, such as visual question answer-ing (VQA) tasks where a frame of image anda query sentence are provided as model inputs.A question-guided attention mechanism is pro-posed in VQA tasks (Chen et al., 2015) and canboost model performance compared to those usingglobal image features. Attribute prediction layeris introduced (Wu et al., 2016) as a way to incor-porate high-level concepts into the CNN-LSTMframework. Wang et al. (2017) exploit a hand-ful of off-the-shelf algorithms, gluing them with aco-attention model and achieve generalizability aswell as scalability. Yang et al. (2014) try imageemotion extraction tasks with image commentsand propose a model to bridge images and com-ment information by learning Bernoulli parame-ters.

3 Proposed Hierarchical Fusion Model

Figure 2 shows the architecture of our proposedhierarchical fusion model. In this work, we treattext, image and image attribute as three modalities.Image attribute modality has been shown to boostmodel performance by adding high-level conceptof the image content (Wu et al., 2016). Modalityfusion techniques are proposed to make full use ofthe three modalities. In the following paragraph,we will first define raw vectors and guidance vec-tors, and then briefly introduce our hierarchicalfusion techniques.

For the image modality, we use a pre-trainedand fine-tuned ResNet model to obtain 14 × 14regional vectors of the tweet image, which is de-fined as the raw image vectors, and average them

to get our image guidance vector. For the (image)attribute modality, we use another pre-trained andfine-tuned ResNet models to predict 5 attributesfor each image, the GloVe embeddings of whichare considered as the raw attribute vectors. Ourattribute guidance vector is a weighted averageof the raw attribute vectors. We use Bi-LSTMto obtain our text vectors. The raw text vectorsare the concatenated forward and backward hid-den states for each time step of the Bi-LSTM,while the text guidance vector is the average of theabove raw vectors. In the belief that the attachedimage could aid the model’s understanding of thetweet text, we apply non-linear transformations onthe attribute guidance vector and feed the resultto the Bi-LSTM as its initial hidden state. Thisprocess is named early fusion. In order to utilizemultimodal information to refine representationsof all modalities, representation fusion is proposedin which feature vectors of the three modalitiesare reconstructed using raw vectors and guidancevectors. The refined vectors of three modalitiesare combined into one vector with weighted av-erage instead of simple concatenation in the pro-cess of modality fusion. Lastly, the fused vectoris pumped into a two layer fully-connected neuralnetwork to obtain classification result. More de-tails of our model are provided below.

3.1 Image Feature Representation

We use ResNet-50 V2 (He et al., 2016) to ob-tain representations of tweet images. We chop thelast fully-connected (FC) layer of the pre-trainedmodel and replace it with a new one for the sakeof model fine-tuning. Following (Wang et al.,2017), a input image I is re-sized to 448 × 448and divided into 14 × 14 regions. Each regionIi (i = 1, 2 . . . , 196) is then sent through theResNet model to obtain a regional feature repre-sentation vregioni , a.k.a. a raw image vector.

vregioni = ResNet(Ii)

As is described before, the image guidance vectorvimage is the average of all regional image vectors.

vimage =

∑Nri=1 vregioniNr

where Nr is the number of regions and is 196 inthis work.

Page 4: Multi-Modal Sarcasm Detection in Twitter with Hierarchical ...Multi-Modal Sarcasm Detection in Twitter with Hierarchical Fusion Model Yitao Cai, Huiyu Cai and Xiaojun Wan Institute

2509

text modality

YUM hospital cafeteria food – at Lake Charles Memorial Hospital

Raw Vectors

Input tweet

image modality attribute modality

Guidance Vectors

Bi-LSTM

ReconstructedFeature Vectors

regionalimage vectors

Rep

rese

ntat

ion

Fusio

nforkknifeseatwhitemeat Early Fusion

Rep

rese

ntat

ion

Fusio

n

Rep

rese

ntat

ion

Fusio

n

Rep

rese

ntat

ion

Fusio

n

Fused vector (for classification)

Modality Fusion

GloVe embedding

Figure 2: Overview of our proposed model

3.2 Attribute Feature Representation

Previous work (Wu et al., 2016) in image caption-ing and visual question answering introduces at-tributes as high-level concepts of images. In theirwork, single-label and multi-label losses are pro-posed to train the attribute prediction CNN, whoseparameters are transferred to generate the final im-age representation. While they use parameter shar-ing for better image representation with attribute-labeling tasks, we take a more explicit approach.We treat attributes as an extra modality bridgingthe tweet text and image, by directly using theword embeddings of five predicted attributes ofeach tweet image as the raw attribute vectors.

We first train an attribute predictor with ResNet-101 and COCO image captioning dataset (Linet al., 2014). We build the multi-label datasetby extracting 1000 attributes from sentences ofthe COCO dataset. We use a ResNet model pre-trained on ImageNet (Russakovsky et al., 2015)and fine-tune it on the multi-label dataset. Thenthe attribute predictor is used to predict five at-tributes ai (i = 1, . . . , 5) for each image.

We generate the attribute guidance vector byweighted average. Raw attribute vectors e(ai) arepassed through a two-layer neural network to ob-tain the attention weights αi for constructing theattribute guidance vector vattr. The related equa-

tions are as follows.

αi =W2 · tanh(W1 · e(ai) + b1) + b2

α = softmax(α)

vattr =

Na∑i=1

αie(ai)

where ai is the ith image attribute, literally a wordout of a vocabulary of 1000; e is the GloVe em-bedding operation; W1 and W2 are weight matri-ces; b1 and b2 are biases; Na is the number of at-tributes, and is 5 in our settings.

3.3 Text Feature RepresentationBidirectional LSTM (Bi-LSTM) (Hochreiter andSchmidhuber, 1997) are used to obtain the repre-sentation of the tweet text. The equations of op-erations performed by LSTM at time step t are asfollows:

it = σ(Wi · xt + Ui · ht−1)ft = σ(Wf · xt + Uf · ht−1)ot = σ(Wo · xt + Uo · ht−1)ct = tanh(Wc · xt + Uc · ht−1)ct = ft � ct−1 + it � ctht = ot � tanh(ct)

where Wi, Wf , Wo, Ui, Uf , Uo are weight matri-ces; xt, ht are input state and hidden state at timestep t, respectively; σ is the sigmoid function; �denotes element-wise product. The text guidance

Page 5: Multi-Modal Sarcasm Detection in Twitter with Hierarchical ...Multi-Modal Sarcasm Detection in Twitter with Hierarchical Fusion Model Yitao Cai, Huiyu Cai and Xiaojun Wan Institute

2510

vector is the arithmetic average of hidden states ineach time step.

vtext =

∑Li=1 htL

where L is the length of the tweet text.

3.4 Early Fusion

The Bi-LSTM initial states are usually set to ze-roes in text classification tasks, but it is a poten-tial spot where multi-modal information could beinfused to promote the modal’s comprehension ofthe text. In the proposed model, we apply the non-linearly transformed attribute guidance vector asthe initial state of Bi-LSTM.

[hf0;hb0; cf0; cb0] = ReLu(W · vattr + b)

where hf0, cf0 are forward LSTM initial statesand hb0, cb0 are backward LSTM initial states; [; ]is vector concatenation; ReLu denotes element-wise application of the Rectified Linear Units ac-tivation function; W and b are weight matrix andbias.

We also try to use image guidance vector forearly fusion, in which the LSTM initial states areobtained with means similar to the one describedabove, but it does not perform very well, as will bediscussed in the experiments.

3.5 Representation Fusion

Inspired by attention mechanism in VQA tasks,representation fusion aims at reconstructing thefeature vectors vimage, vtext, vattr with the help oflow-level raw vectors (namely, the hidden statesof time step t {ht} for the text modality, the 196regional vectors for the image modality, and thefive attribute embeddings for the attribute modal-ity) and high-level guidance vectors from differentmodalities.

We denote X(i)m as the ith raw vector from

modality m (which may be text, image or at-tribute). The key in this stage is to calculate theweight for each X(i)

m . The weighted average thenbecomes the new representation of modality m.

To leverage as much information as possibleand more accurately model the relationship be-tween multiple modalities, we exploit informa-tion from all three modalities - more explicitly,guidance vectors vn where n could be text, im-age or attribute, when calculating the weights of

raw vectors in each modality. For the ith raw vec-tor of each modality m, we calculate three guidedweights α(i)

mn from the guidance vectors of differ-ent modalities n. The final reconstruction weightfor the raw vector is the average of the normalizedguided weights.

α(i)mn =Wmn2 · tanh(Wmn1 · [X(i)

m ; vn] + bmn1)

+ bmn2

αmn = softmax(αmn)

α(i)m =

∑n∈{text, image, attr} α

(i)mn

3

vm =

Lm∑i=1

α(i)mX(i)

m

wherem,n ∈ {text, image, attr} denote modal-ities; α(i)

mn is the guided weight for the ith raw vec-tor of modality m under the guidance of modalityn, and αmn contains all α(i)

mn of all raw vectorsof modality m under the guidance of modality n;α(i)m is the final reconstruction weight for the ith

raw vector of modality m; Lm is the length of se-quence {X(i)

m }; Wmn1 ,Wmn2 are weight matricesand bmn1 , bmn2 are biases.

After representation fusion, vimage, vtext, vattr,previously denoted as guidance vectors, are nowconsidered feature vectors of each modality andready to serve as inputs of the next layer.

3.6 Modality FusionInstead of simply concatenating the feature vec-tors from different modalities to form a longervector, we perform modality fusion motivated bythe work of (Gu et al., 2018a). The feature vec-tor for each modality m, denoted as vm, is firsttransformed into a fixed-length form v′m. A two-layer feed-forward neural network is implementedto calculate the attention weights for each modal-ity m, which is then used in the weighted averageof transformed feature vectors v′m. The result is asingle, fixed-length vector vfused.

αm =Wm2 · tanh(Wm1 · vm + bm1) + bm2

α = softmax(α)

v′m = tanh(Wm3 · vm + bm3)

vfused =∑

m∈{text, image, attr}

αmv′m

where m is one of the three modalities and α isa vector containing αm; Wm1 , Wm2 , Wm3 are

Page 6: Multi-Modal Sarcasm Detection in Twitter with Hierarchical ...Multi-Modal Sarcasm Detection in Twitter with Hierarchical Fusion Model Yitao Cai, Huiyu Cai and Xiaojun Wan Institute

2511

Training Development Test

sentences 19816 2410 2409positive 8642 959 959negative 11174 1451 1450

Table 1: Statistics of our dataset

weight matrices. bm1 , bm2 , bm3 are biases; vm rep-resents reconstructed feature vectors in the repre-sentation fusion process.

3.7 Classification layer

We use a two layer fully-connected neural net-work as our classification layer. The activationfunction of the hidden layer and the output layerare element-wise ReLu and sigmoid functions, re-spectively. The loss function is cross entropy.

4 Dataset and Preprocessing

There is no publicly available dataset for evaluat-ing the multi-modal sarcasm detection task, andthus we build our own dataset, which will be re-leased later. We collect and preprocess our datasimilar to (Schifanella et al., 2016). We collectEnglish tweets containing a picture and some spe-cial hashtag (e.g., #sarcasm, etc.) as positive ex-amples (i.e. sarcastic) and collect English tweetswith images but without such hashtags as negativeexamples (i.e. not sarcastic). We further clean upthe data as follows. First, we discard tweets con-taining sarcasm, sarcastic, irony, ironic as regularwords. We also discard tweets containing URLsin order to avoid introducing additional informa-tion. Furthermore, we discard tweets with wordsthat frequently co-occur with sarcastic tweets andthus may express sarcasm, for instance jokes, hu-mor and exgag. We divide the data into trainingset, development set and test set with a ratio of80%:10%:10%. In order to evaluate models moreaccurately, we manually check the developmentset and the test set to ensure the accuracy of thelabels. The statistics of our final dataset are listedin table 1.

For preprocessing, we first replace mentionswith a certain symbol 〈user〉. We then separatewords, emoticons and hashtags with the NLTKtoolkit. We also separate hashtag sign # fromhashtags and replace capitals with their lower-cases. Finally, words appearing only once in thetraining set and words not appearing in the train-ing set but appearing in the development set or test

Hyper-parameters Value

LSTM hidden size 256Batch size 32

Learning rate 0.001Gradient Clipping 5Early stop patience 5

Word and attribute embedding size 200ResNet FC size 1024

Modality fusion size 512LSTM dropout rate 0.2

Classification layer l2 parameters 1e-7

Table 2: Hyper-parameters

set are replaced with a certain symbol 〈unk〉.

5 Experiments

5.1 Training Details

Pre-trained models. The pre-trained ResNetmodel is available online. The word embeddingsand attribute embeddings are trained on the Twit-ter dataset using Glove (Pennington et al., 2014).Fine tuning. Parameters of the pre-trained ResNetmodel are fixed during training. Parameters ofword and attribute embeddings are updated duringtraining.Optimization. We use the Adam optimizer(Kingma and Ba, 2014) to optimize the loss func-tion.Hyper-parameters. The hidden layer size in theneural networks described in the fusion techniquesis half of its input size. Other hyper-parameters arelisted in table 2.

5.2 Comparison Results

Table 3 shows the comparison results (F-score andAccuracy) of baseline models and our proposedmodel. We implement models with one or multi-ple modalities as baseline models. We also presentthe results of naıve solution (all negative, random)of this task.Random. It randomly predicts whether a tweet issarcastic or not.Text(Bi-LSTM). Bi-LSTM is one of the mostpopular method for addressing many text clas-sification problems. It leverages a bidirectionalLSTM network for learning text representationsand then uses a classification layer to make pre-diction.Text(CNN). CNN is also one of the state-of-the-art methods to address text classification prob-lems. We implement text CNN (Kim, 2014) as abaseline model.

Page 7: Multi-Modal Sarcasm Detection in Twitter with Hierarchical ...Multi-Modal Sarcasm Detection in Twitter with Hierarchical Fusion Model Yitao Cai, Huiyu Cai and Xiaojun Wan Institute

2512

Model F-score Pre Rec Acc

All negative - - - 0.6019Random 0.4470 0.4005 0.5057 0.5027

Text(Bi-LSTM) 0.7753 0.7666 0.7842 0.8190Text(CNN) 0.7532 0.7429 0.7639 0.8003

Image 0.6153 0.5441 0.7080 0.6476Attr 0.6334 0.5606 0.7278 0.6646

Concat(2) 0.7799 0.7388 0.8259 0.8103Concat(3) 0.7874 0.7336 0.8498 0.8174Our model 0.8018 0.7657 0.8415 0.8344

Table 3: Comparison results

Image. Image vectors after the pooling layer ofResNet are inputs of the classification layer. Weonly update parameters of the classification layer.Attr. Since image attribute is one of the modal-ities in our proposed model, we also try to useonly attribute features to make prediction. The at-tribute feature vectors are inputs of the classifica-tion layer.Concat. Previous work (Schifanella et al., 2016)concatenates different feature vectors of differentmodalities as the input of the classification layer.We implement this concatenation model with ourfeature vectors of different modalities and applyit for classification. The number in parentheses isthe number of modalities we use. (2) means con-catenating text features and image features, while(3) means concatenating all text, image and at-tribute features.

We can see that the models based only on theimage or attribute modality do not perform well,while Text(Bi-LSTM) and Text(CNN) models per-form much better, indicating the important role oftext modality. The Concat(3) model outperformsConcat(2), because adding attributes as a newmodality actually introduces external semantic in-formation of images and helps the model whenit fails to extract valid image features. Our pro-posed hierarchical fusion model further improvesthe performance and achieves the state-of-the-artscores, revealing that our fusion model leveragesfeatures of three modalities in a more effectiveway.

We further apply sign tests between our pro-posed model and Text(Bi-LSTM), Concat(2),Concat(3) models. The null hypotheses are thatour proposed model doesn’t perform better thaneach baseline model. The statistics of the sign testsare listed in table 4. All significance levels are lessthan 0.05. Therefore, all of the null hypotheses isrejected and our proposed model significantly per-

Concat(3) Concat(2) Text(Bi-LSTM)

t+ 106 149 120t− 65 91 83p 0.0011 0.0001 0.0057

Table 4: Statistics of sign tests. (t+ is the number oftweets that our proposed model predicts them right butbaseline models do not. t− is the number of tweets thatbaseline models predict them right but our proposedmodel does not. p is the significance value.)

forms better than baseline models.

5.3 Component Analysis of Our ModelWe further evaluate the influence of early fusion,representation fusion, as well as different modalityrepresentation in early fusion on the final perfor-mance. The evaluation results are listed in Table 5.

F-score Pre Rec Acc

w/o EF 0.7880 0.7570 0.8217 0.8240w/o RF 0.7902 0.7456 0.8405 0.8223EF(img) 0.7787 0.7099 0.8624 0.8049

Our model 0.8018 0.7657 0.8415 0.8344

Table 5: Ablation study. ‘w/o’ means removal of thiscomponent. EF denotes early fusion. RF denotes repre-sentation fusion. EF(img) means using image guidancevectors for early fusion.

We can see that the removal of early fusion de-creases the performance, which shows that earlyfusion can improve the text representation. Earlyfusion with attribute representation performs bet-ter than that with image representation, indicatingthe gap between text representation and image rep-resentation. If representation fusion is removed,the performance is also decreased, which indicatesthat representation fusion is necessary and that therepresentation fusion can refine the feature repre-sentation of each modality.

6 Visualization Analysis

6.1 Running ExamplesFigure 3 shows some sarcastic examples that ourproposed model predicts them correctly while themodel with only text modality fails to label themright. It shows that with our model, images andattributes can contribute to sarcasm detection. Forexample, an image with a dangerous tackle and atext saying ’not dangerous’ convey strong sarcasmin example (a). ’Respectful customers’ is contra-dicted to the messy parcels as well as the attribute

Page 8: Multi-Modal Sarcasm Detection in Twitter with Hierarchical ...Multi-Modal Sarcasm Detection in Twitter with Hierarchical Fusion Model Yitao Cai, Huiyu Cai and Xiaojun Wan Institute

2513

(a) this isn 't dangerous . going to teachmy players to tackle like this at practicefirst thing tomorrow morning . (b) i love respectful customers (c) <user> your counselor is so cute . gladyou 're staffing up so well .attributes:field players playing soccer men attributes:pile stack messy sitting boxes attributes:teddy bear wearing hat brown Figure 3: Examples of sarcastic tweets

Weather ‘s lookin amazing today � ...

Attributes:houses street sitting

trees near

happy testing monday ! eat a

good breakfast # serious

yum hospital cafeteria food – at

lake charles memorial hospital

Attributes:cloth sitting colored

table meat

Attributes:fork knife sitting

white meat

(a) (b) (c)

Figure 4: Attention visualization of sarcastic tweets

’messy’ in example (b). Without images, success-fully detecting these sarcasm instances is almostimpossible. The model with only text modalityfails to detect sarcasm as for example (c), thoughthe word so is repeated several times in example(c). However, with image and attribute modalities,our proposed model correctly detects sarcasm inthese tweets.

6.2 Attention Visualization

Figure 4 shows the attention of some examples atthe representation fusion stage. Our model cansuccessfully focus on the appropriate parts of theimage, the essential words in the sentences and theimportant attributes. For example, our model paysmore attention on the unamused face emoji and theword ’amazing’ for texts, and pays more attentionon the gloomy sky in example (a), thus this tweetis predicted as sarcastic tweet because of the in-consistency of these two modalities. In example(b), our model focuses on the word ’serious’ intexts and focuses on the simple meal in the picturethat contradicts to the ’good breakfast’, revealingthat this tweet should be sarcastic. In example (c),the word ’yum’, the attribute ’meat’ and the foodin the image indicate the sarcastic meaning of thetweet.

6.3 Error Analysis

Figure 5 shows an example that our model fails tolabel it right.

yo <user> thanks for the yearly fee reminder! Here's to you! #planetfitness #hiddenfee

#mrmet

Attributes:ball holding shoes little white

Figure 5: Example of misclassified samples

In the example, the insulting gesture in the pic-ture is contrast to the phrase ’thanks for’. How-ever, the model is unable to obtain the commonsense that this gesture is insulting. Therefore, theattention of this picture does not focus on the in-sulting gesture. Moreover, attributes do not revealthe insulting meaning of the pictures as well, thusour model fails to predict this tweet as sarcastic.

7 Conclusion and Future Work

In this paper we propose a new hierarchical fu-sion model to make full use of three modalities(images, texts and image attributes) to address thechallenging multi-modal sarcasm detection task.Evaluation results demonstrate the effectiveness ofour proposed model and the usefulness of the threemodalities. In future work, we will incorporateother modality such as audio into the sarcasm de-tection task and we will also investigate to makeuse of common sense knowledge in our model.

Page 9: Multi-Modal Sarcasm Detection in Twitter with Hierarchical ...Multi-Modal Sarcasm Detection in Twitter with Hierarchical Fusion Model Yitao Cai, Huiyu Cai and Xiaojun Wan Institute

2514

Acknowledgment

This work was supported by National Natural Sci-ence Foundation of China (61772036) and KeyLaboratory of Science, Technology and Standardin Press Industry (Key Laboratory of IntelligentPress Media Technology). We thank the anony-mous reviewers for their helpful comments. Xiao-jun Wan is the corresponding author.

ReferencesSilvio Amir, Byron C. Wallace, Hao Lyu, Paula Car-

valho, and Mario J. Silva. 2016. Modelling contextwith user embeddings for sarcasm detection in socialmedia. CoRR, abs/1607.00976.

David Bamman and Noah A. Smith. 2015. Contextu-alized sarcasm detection on twitter. In ICWSM.

Christos Baziotis, Nikos Athanasiou, PinelopiPapalampidi, Athanasia Kolovou, GeorgiosParaskevopoulos, Nikolaos Ellinas, and AlexandrosPotamianos. 2018. Ntua-slp at semeval-2018 task3: Tracking ironic tweets using ensembles of wordand character level attentive rnns. arXiv preprintarXiv:1804.06659.

M. Bouazizi and T. Ohtsuki. 2015. Sarcasm detectionin twitter: ”all your products are incredibly amaz-ing!!!” - are they really? In 2015 IEEE Global Com-munications Conference (GLOBECOM), pages 1–6.

Kan Chen, Jiang Wang, Liang-Chieh Chen, HaoyuanGao, Wei Xu, and Ram Nevatia. 2015. Abc-cnn: An attention based convolutional neural net-work for visual question answering. arXiv preprintarXiv:1511.05960.

Dmitry Davidov, Oren Tsur, and Ari Rappoport. 2010.Semi-supervised recognition of sarcastic sentencesin twitter and amazon. In Proceedings of theFourteenth Conference on Computational NaturalLanguage Learning, CoNLL ’10, pages 107–116,Stroudsburg, PA, USA. Association for Computa-tional Linguistics.

Shelly Dews and Ellen Winner. 1995. Muting themeaning a social function of irony. Metaphor andSymbol, 10(1):3–19.

Aniruddha Ghosh and Dr. Tony Veale. 2016. Frack-ing sarcasm using neural network. In Proceedingsof the 7th Workshop on Computational Approachesto Subjectivity, Sentiment and Social Media Analy-sis, pages 161–169. Association for ComputationalLinguistics.

Yue Gu, Kangning Yang, Shiyu Fu, Shuhong Chen,Xinyu Li, and Ivan Marsic. 2018a. Hybrid atten-tion based multimodal network for spoken languageclassification. In Proceedings of the 27th Inter-national Conference on Computational Linguistics,

pages 2379–2390. Association for ComputationalLinguistics.

Yue Gu, Kangning Yang, Shiyu Fu, Shuhong Chen,Xinyu Li, and Ivan Marsic. 2018b. Multimodal af-fective analysis using hierarchical attention strategywith word-level alignment. In Proceedings of the56th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages2225–2235. Association for Computational Linguis-tics.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Identity mappings in deep residual net-works. CoRR, abs/1603.05027.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Yoon Kim. 2014. Convolutional neural networks forsentence classification. CoRR, abs/1408.5882.

Diederik P. Kingma and Jimmy Ba. 2014. Adam:A method for stochastic optimization. CoRR,abs/1412.6980.

Tsung-Yi Lin, Michael Maire, Serge Belongie, JamesHays, Pietro Perona, Deva Ramanan, Piotr Dollar,and C. Lawrence Zitnick. 2014. Microsoft coco:Common objects in context. In Computer Vision –ECCV 2014, pages 740–755, Cham. Springer Inter-national Publishing.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for wordrepresentation. In Conference on Empirical Meth-ods in Natural Language Processing, pages 1532–1543.

Soujanya Poria, Erik Cambria, and Alexander Gel-bukh. 2015. Deep convolutional neural networktextual features and multiple kernel learning forutterance-level multimodal sentiment analysis. InProceedings of the 2015 conference on empiri-cal methods in natural language processing, pages2539–2544.

Soujanya Poria, Erik Cambria, Devamanyu Hazarika,and Prateek Vij. 2016. A deeper look into sarcas-tic tweets using deep convolutional neural networks.CoRR, abs/1610.08815.

Tomas Ptacek, Ivan Habernal, and Jun Hong. 2014.Sarcasm detection on czech and english twitter.In Proceedings of COLING 2014, the 25th Inter-national Conference on Computational Linguistics:Technical Papers, pages 213–223. Dublin City Uni-versity and Association for Computational Linguis-tics.

Ellen Riloff, Ashequl Qadir, Prafulla Surve, Lalin-dra De Silva, Nathan Gilbert, and Ruihong Huang.2013. Sarcasm as contrast between a positive senti-ment and negative situation. In EMNLP 2013 - 2013

Page 10: Multi-Modal Sarcasm Detection in Twitter with Hierarchical ...Multi-Modal Sarcasm Detection in Twitter with Hierarchical Fusion Model Yitao Cai, Huiyu Cai and Xiaojun Wan Institute

2515

Conference on Empirical Methods in Natural Lan-guage Processing, Proceedings of the Conference,pages 704–714. Association for Computational Lin-guistics (ACL).

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause,Sanjeev Satheesh, Sean Ma, Zhiheng Huang, An-drej Karpathy, Aditya Khosla, Michael Bernstein,Alexander C. Berg, and Li Fei-Fei. 2015. Ima-geNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision (IJCV),115(3):211–252.

Rossano Schifanella, Paloma de Juan, Joel R.Tetreault, and Liangliang Cao. 2016. Detectingsarcasm in multimodal social platforms. CoRR,abs/1608.02289.

Yi Tay, Luu Anh Tuan, Siu Cheung Hui, and JianSu. 2018. Reasoning with sarcasm by reading in-between. CoRR, abs/1805.02856.

Haohan Wang, Aaksha Meghawat, Louis-PhilippeMorency, and Eric P. Xing. 2016. Select-additivelearning: Improving cross-individual generaliza-tion in multimodal sentiment analysis. CoRR,abs/1609.05244.

Peng Wang, Qi Wu, Chunhua Shen, and Anton van denHengel. 2017. The vqa-machine: Learning how touse existing vision algorithms to answer new ques-tions. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn,volume 4.

Chuhan Wu, Fangzhao Wu, Sixing Wu, JunxinLiu, Zhigang Yuan, and Yongfeng Huang. 2018.Thu ngn at semeval-2018 task 3: Tweet irony de-tection with densely connected lstm and multi-tasklearning. In Proceedings of The 12th InternationalWorkshop on Semantic Evaluation, pages 51–56.

Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony Dick,and Anton van den Hengel. 2016. What value do ex-plicit high level concepts have in vision to languageproblems? In Proceedings of the IEEE conferenceon computer vision and pattern recognition, pages203–212.

Yang Yang, Jia Jia, Shumei Zhang, Boya Wu, QicongChen, Juanzi Li, Chunxiao Xing, and Jie Tang. 2014.How do your friends on social media disclose youremotions? In Twenty-Eighth AAAI Conference onArtificial Intelligence, pages 306–312.

Amir Zadeh, Minghai Chen, Soujanya Poria, ErikCambria, and Louis-Philippe Morency. 2017. Ten-sor fusion network for multimodal sentiment analy-sis. CoRR, abs/1707.07250.

Meishan Zhang, Yue Zhang, and Guohong Fu. 2016.Tweet sarcasm detection using deep neural network.In COLING.