sarcasm detection using hybrid neural network · sarcasm detection using hybrid neural network...

Sarcasm Detection using Hybrid Neural Network

Rishabh MisraA53205530

University of California San [email protected]

Prahal AroraA53219500

University of California San [email protected]

Abstract

Sarcasm Detection has enjoyed great in-terest from the research community, how-ever the task of predicting sarcasm in atext remains an elusive problem for ma-chines. Past studies mostly make useof twitter datasets collected using hash-tag based supervision but such datasetsare noisy in terms of labels and language.To overcome these shortcoming, we intro-duce a new dataset which contains newsheadlines from a sarcastic news websiteand a real news website. Next, we pro-pose a hybrid Neural Network architecturewith attention mechanism which providesinsights about what actually makes sen-tences sarcastic. Through experiments, weshow that the proposed model improvesupon the baseline by ∼ 5% in terms ofclassification accuracy.

1 Limitations of previous work

(Amir et al., 2016) propose to use a CNN to au-tomatically extract relevant features from tweetsand augment them with user embeddings to pro-vide more contextual features during sarcasm de-tection. However, this work is limited in followingaspects:

• Twitter dataset used in the study was col-lected using hashtag based supervision. Asper various studies [(Liebrecht et al., 2013;Joshi et al., 2017)], such datasets have noisylabels. Furthermore, people use very in-formal language on twitter which introducessparsity in vocabulary and for many wordspre-trained embeddings are not available.Lastly, many tweets are replies to othertweets and detecting sarcasm in these re-quires the availability of contextual tweets.

• The modeling proposed is quite simplis-tic. Authors use CNN with one convolu-tional layer to extract relevant features fromtext which are then concatenated with (pre-trained) user embeddings to produce the finalclassification score. However, some studieslike (Yin et al., 2017) show that RNNs aremore suitable for sequential data. Further-more, authors propose a separate method tolearn the user embeddings which means themodel is not trainable end to end.

• Authors do not provide any qualitative analy-sis from the model to show where the modelis performing well and where it is not.

• Upon analysis, we understand that detectingsarcasm requires understanding of commonsense knowledge without which the modelmight not actually understand what sarcasmis and just pick up some discriminative lex-ical cues. This direction has not been ad-dressed in previous studies to the best of ourknowledge.

In section 2, we describe the dataset collectedby us to overcome the limitations of Twitterdatasets. In section 3, we describe the network ar-chitecture of the proposed model. In section 4 andsection 5, we provide experiment details, resultsand analysis. To conclude, we provide few futuredirections in section 6.

2 Dataset

To overcome the limitations related to noise inTwitter datasets, we collected a new Headlinesdataset1 from two news website. TheOnion2 aimsat producing sarcastic versions of current events

1https://github.com/rishabhmisra/Headlines-Dataset-For-Sarcasm-Detection

2https://www.theonion.com/

arX

iv:1

908.

0741

4v1

[cs

.LG

] 2

0 A

ug 2

019

https://github.com/rishabhmisra/Headlines-Dataset-For-Sarcasm-Detection

https://github.com/rishabhmisra/Headlines-Dataset-For-Sarcasm-Detection

https://www.theonion.com/

Figure 1: Wordclouds of sarcastic and non-sarcastic headlines respectively.

and we collected all the headlines from News inBrief and News in Photos categories (which aresarcastic). We collect real (and non-sarcastic)news headlines from HuffPost3. As a basic explo-ration, we visualize the word clouds in Figure 1through which we can see the types of words thatoccur frequently in each category. The generalstatistics of this dataset along with dataset pro-vided by Semeval challenge4 are given in Table 1.We can notice that for Headlines dataset, wheretext is much more formal in language, percentageof words not available in word2vec vocabulary ismuch less than Semeval dataset.

Statistic/Dataset Headlines Semeval

# Records 26,709 3,000# Sarcastic records 11,725 2,396# Non-sarcastic records 14,984 604% word embeddings not available 23.35 35.53

Table 1: General statistics of datasets.

This new dataset has following advantages overthe existing Twitter datasets:

• Since news headlines are written by pro-fessionals in a formal manner, there are nospelling mistakes and informal usage. Thisreduces the sparsity and also increases thechance of finding pre-trained embeddings.

• Furthermore, since the sole purpose of TheO-nion is to publish sarcastic news, we get highquality labels with much less noise as com-pared to twitter datasets.

• Unlike tweets which are replies to othertweets, the news headlines we obtained are

3https://www.huffingtonpost.com/4https://competitions.codalab.org/

competitions/17468

self-contained. This would help us in teasingapart the real sarcastic elements.

3 Network Architecture

The original architecture of (Amir et al., 2016)takes pre-trained user embeddings (context) andtweets (content) as input and outputs a binaryvalue for sarcasm detection. We tweaked this ar-chitecture to remove the user-context modelingpath since the mention of sarcasm in this datasetis not dependent on authors but rather on currentevents and common knowledge. In addition tothat, a new LSTM module is added to encode theleft (and right) context of the words in a sentenceat every time step. This LSTM module is supple-mented with an Attention module to reweigh theencoded context at every time step.

We hypothesize that the sequential informationencoded in LSTM module would complement theexisting CNN module in the original architectureof (Amir et al., 2016) which captures regular n-gram word patterns throughout the entire lengthof the sentence. We also hypothesize that atten-tion module can really benefit the task at hand.It can selectively emphasize on incongruent co-occurring word phrases (words with contrastingimplied sentiment). For example, in the sentence“majority of nations civic engagement centeredaround oppressing other people”, our attentivemodel can emphasize on occurrence of ‘civic en-gagement’ and ‘oppressing other people’ to clas-sify this sentence as sarcastic. The detailed archi-tecture of our model is illustrated in figure 2.

The LSTM module with attention is similar tothe one used to jointly align and translate in aNeural Machine Translation task (Bahdanau et al.,2014). A BiLSTM consists of forward and back-ward LSTMs. The forward LSTM calculates asequence of forward hidden states and the back-ward LSTM reads the sequence in the reverse or-der to calculate backward hidden states. We obtainan annotation for each word in the input sentenceby concatenating the forward hidden state and thebackward one. In this way, the annotation hj con-tains the summaries of both the preceding wordsand the following words. Due to the tendency ofLSTMs to better represent recent inputs, the an-notation at any time step will be focused on thewords around that time step in the input sentence.Each hidden state contains information about thewhole input sequence with a strong focus on the

https://www.huffingtonpost.com/

https://competitions.codalab.org/competitions/17468

https://competitions.codalab.org/competitions/17468

Figure 2: Hybrid Network Architecture

parts surrounding the corresponding input word ofthe input sequence. The context vector c is, then,computed as a weighted sum of these annotations.

c =N∑i=1

αihi

Here, αi is the weight/attention of a hidden statehi calculated by computing Softmax over scoresof each hidden state. The score of each individualhi is calculated by forwarding hi through a multi-layer perceptron that outputs a score.

The context vector c is finally concatenated tothe output of the CNN module. Together, this largefeature vector is then fed to an MLP which outputsthe binary probability distribution of the sentencebeing sarcastic/non-sarcastic.

4 Experiments

4.1 Baseline

With new dataset in hand, we tweak the model of(Amir et al., 2016) and consider it as a baseline.We remove the author embedding component be-cause now the sarcasm is independent of authors(it is based on current events and common knowl-edge). The CNN module remains intact.

4.2 Experimental Setup

To represent the words, we use pre-trained em-beddings from word2vec model and initialize themissing words uniformly at random in both themodels. These are then tuned during the trainingprocess. We create train, validation and test set bysplitting data randomly in 80:10:10 ratio. We tunethe hyper-parameters like learning rate, regulariza-tion constant, output channels, filter width, hiddenunits and dropout fraction using grid search. Themodel is trained by minimizing the cross entropyerror between the predictions and true labels, thegradients with respect to the network parametersare computed with backpropagation and the modelweights are updated with the AdaDelta rule. Codefor both the methods is available on GitHub5.

5 Results and Analysis

5.1 Quantitative Results

We report the quantitative results of the baselineand the proposed method in terms of classificationaccuracy, since the dataset is mostly balanced. Thefinal classification accuracy after hyper-parametertuning is provided in Table 2. As shown, ourmodel improves upon the baseline by ∼ 5% which

5https://github.com/rishabhmisra/Sarcasm-Detection-using-CNN

https://github.com/rishabhmisra/Sarcasm-Detection-using-CNN

https://github.com/rishabhmisra/Sarcasm-Detection-using-CNN

supports our first hypothesis mentioned in sec-tion 3. The performance trend of our model isshown in Figure 3.

Implementation Test Accuracy

Baseline 84.88%Proposed method 89.7%

Table 2: Performance of baseline and proposedmethod in terms of classification accuracy

Figure 3: Loss and accuracy trend of the proposedmethod.

5.2 Qualitative ResultsWe visualize the attention over some of the sar-castic sentences in the test set that are correctlyclassified with high confidence scores. This wouldhelp us better understand if our hypothesis is cor-rect and provide insights into sarcasm detectionprocess. Figure 4a and Figure 4b show that theattention module emphasizes on co-occurrence ofincongruent word phrases within each sentence,such as ‘civic engagement’ & ‘oppressing otherpeople’ in 4a and ‘excited for’ & ‘insane k-popsh*t during opening ceremony’ in 4b. This in-congruency is an important cue for us humans tooand supports our second hypothesis mentioned insection 3. This has been extensively studied in(Joshi et al., 2015). Figure 4c shows that presenceof ’bald man’ indicates that this news headlineis rather insincere probably meant for ridiculingsomeone. Similarly, ‘stopped paying attention’ in

(a) (b)

(c) (d)

Figure 4: Visualizing attention over the entirelength of the sarcastic sentences

Figure 4d has more probability to show up in satir-ical sentence, rather than a sincere news headline.

6 Future Work

Given the time crunch, we are left with several un-explored directions that we would like to work onin future. Some of the important directions are asfollows:

• We can do ablation study on our proposed ar-chitecture to analyze the contribution of eachmodule.

• The approach proposed in this work couldbe considered as a pre-computation step andthe learned parameters could be tuned fur-ther on Semeval dataset. Our intuition behindthis direction is that this pre-computation stepwould allow us to capture the general cuesfor sarcasm which would be hard to learn onSemeval dataset alone (given its small size).This type of transfer learning is shown to beeffective when limited data is available [(Panand Yang, 2010)].

• Lastly, we observe that detection of sarcasmdepends a lot on common knowledge (cur-rent events and common sense). Thus, weplan to integrate this knowledge in our net-work so that our model is able to detect sar-casm based on which sentences deviate fromcommon knowledge. Recently, (Young et al.,2017) integrated such knowledge in dialoguesystems and the ideas mentioned could beadapted in our setting as well.

Contributions from Team Members

We did pair programming for this assignment.Both of the team members contributed equally.

ReferencesSilvio Amir, Byron C Wallace, Hao Lyu, and Paula

Carvalho Mario J Silva. 2016. Modelling contextwith user embeddings for sarcasm detection in so-cial media. arXiv preprint arXiv:1607.00976 .

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv preprintarXiv:1409.0473 .

Aditya Joshi, Pushpak Bhattacharyya, and Mark J Car-man. 2017. Automatic sarcasm detection: A survey.ACM Computing Surveys (CSUR) 50(5):73.

Aditya Joshi, Vinita Sharma, and Pushpak Bhat-tacharyya. 2015. Harnessing context incongruity forsarcasm detection. In Proceedings of the 53rd An-nual Meeting of the Association for ComputationalLinguistics and the 7th International Joint Confer-ence on Natural Language Processing (Volume 2:Short Papers). volume 2, pages 757–762.

Christine Liebrecht, Florian Kunneman, and Antalvan den Bosch. 2013. The perfect solution for de-tecting sarcasm in tweets #not. In WASSA@NAACL-HLT .

Sinno Jialin Pan and Qiang Yang. 2010. A survey ontransfer learning. IEEE Transactions on knowledgeand data engineering 22(10):1345–1359.

Wenpeng Yin, Katharina Kann, Mo Yu, and HinrichSchutze. 2017. Comparative study of cnn and rnnfor natural language processing. arXiv preprintarXiv:1702.01923 .

Tom Young, Erik Cambria, Iti Chaturvedi, MinlieHuang, Hao Zhou, and Subham Biswas. 2017. Aug-menting end-to-end dialog systems with common-sense knowledge. arXiv preprint arXiv:1709.05453.

sarcasm detection using hybrid neural network · sarcasm detection using hybrid neural network...

Documents