sarcasm detection in social media · sarcasm detection in social media ankit signhaniya 1282 w 29th...

Sarcasm Detection in Social Media

Ankit Signhaniya 1282 W 29th St., #4 Los Angeles, CA

[email protected]

Gaurav Shenoy 2656 Ellendale Pl., #2

Los Angeles, CA [email protected]

Rohit Kondekar 1282 W 29th St., #4 Los Angeles, CA

[email protected]

Abstract

Sarcasm is a form of speech act in which speakers convey their message in form of sharply ironical taunt. Sarcasm trans-forms polarity of an apparently positive or negative utterance into its positive. We collected and gathered a corpus of 40 thousand sarcastic tweets (with #sarcasm tag) and 170 thousand non-sarcastic tweets. This report presents a machine learning approach combined with Natural Language techniques, thereby producing a model, which improves on the current research. Experiments show that Bag-of-Words (BOW) tends to be more im-portant than sentiment or topic probabil-ity-ties, which suggests that vocabulary is more important than tone. Using model with features consisting of Part-of-Speech (POS) Ngrams, Punctuation char-acteristics, Sentiments, Topic word Porb-abilities, Word Trimming and Word Ngrams gives us the best F-Score of 64% on testing set consisting of 4 thousand #sarcastic tweets and 22 thousand non-sarcastic tweets. Also, we found that a simple Neural Net, with single projection layer meets the performance of general classifiers.

1 Introduction

Sarcasm is a speech act where speaker conveys the meaning implicitly. This makes the sarcasm detection task interesting and challenging at the same time. It even hard for humans to sometime detect sarcasm as it is subjective to the con-text, people, language, culture, situation, etc. A same sentence could be sarcastic or may not be sarcas-tic depending on the situation. Most often, Sar-casm depends on tone, which further makes the

sarcasm detection in text challenging. Because the intended meaning differs from semantic un-derstanding of the sentence, syntactic and seman-tic clues doesn’t always help.

Automated detection of sarcasm is still in ear-ly stages. One reason for lack of computational models has been the difficulty to correctly identi-fy and label sarcastic tweets. Micro blogging sites such as Twitter provides us with much needed training set using hashtag related to sar-casm such as #sarcasm or #sarcastic.

Past research is mostly focused on either word features or contextual features. In addition, there hasn’t been any substantial work on identifying performance of neural nets as a way to estimate sarcasm in text. In this report we provide an amalgam of various settings of using sentiments, topics and word features. Also we make system-atic study of possible use of neural nets using syntactic and semantic structures.

Detection of sarcasm if important as it in turn will help in building better sentiment analyzers for review summarization, dialogue systems and review ranking systems, by disambiguating the word sense. Current social and political scenario demands a better model to identify influential sections in community with lowest form of wit in twitter sphere. On similar notes, US secret ser-vice agency needs to accurately determine sar-castic text and false positives to detect potential threats to national security.

Couldn't ask for any better weather to play soccer in !#shittyweather #rainorshine.

What a fabulous day .i'm so happy people are being so nice today .

I picked the perfect day to ride to work.

Absolutely adore it when my bus is late!

Table 1: Example of Sarcastic Tweets

2 Related Work

This task has been addressed only in few studies, mostly in context of spoken dialogues and pri-marily on speech related cues such as tone, pitch and laughter (Tepperman et al., 2006).But in text related field, automatic detection of sarcasm is considered a difficult problem. (Nigam and Hurst, 2006 and Pang & Lee, 2008). Roberto et al., (2011) use lexical features for identification of sarcasm, which as inspired by Kreuz and Caucci (2007).

Gonza ́lez-Iba ́n ̃ez et al. (2011) experimented with Twitter data divided into three categories (sarcastic, positive sentiment and negative senti-ment), each containing 900 tweets. They used the #sarcasm and #sarcastic hashtags to identify sar-castic tweets. They used two classifiers – support vector machine (SVM) with sequential minimal optimization (SMO) and logistic regression. They tried various combinations of unigrams, dictionary-based features and pragmatic factors (positive and negative emoticons and user refer-ences), achieving the best result (accuracy 0.65) for sarcastic and non-sarcastic classification with the combination of SVM with SMO and uni-grams. They employed 3 human judges to anno-tate 180 tweets (90 sarcastic and 90 non-sarcastic). The human judges achieved Fleiss’ κ = 0.586, demonstrating the difficulty of sarcasm classification. Another experiment included 50 sarcastic and 50 non-sarcastic (25 positive, 25 negative) tweets with emoticons annotated by two judges. The automatic classification and human judges achieved the accuracy of 0.71 and 0.89 respectively. The inter-annotator agreement (Cohen’s κ) was 0.74.

Reyes et al. (2012) proposed features to cap-ture properties of a figurative language such as ambiguity, polarity, unexpectedness and emo-tional scenarios. Their corpus consists of five categories (humor, irony, politics, technology and general), each containing 10,000 tweets. The best result in the classification of irony and gen-eral tweets was F-measure 0.65.

The work of Riloff et al. (2013) identifies one type of sarcasm: contrast between a positive sen-timent and negative situation. They used a boot-strapping algorithm to acquire lists of positive sentiment phrases and negative situation phrases from sarcastic tweets. They proposed a method which classifies tweets as sarcastic if it contains a positive predicative that precedes a negative

situation phrase in close proximity. Their evalua-tion on a human-annotated dataset3 of 3000 tweets (23% sarcastic) was done using the SVM classifier with unigrams and bigrams as features, achieving an F-measure of 0.48. The hybrid ap-proach that combines the results of the SVM classifier and their contrast method achieved an F-measure of 0.51.

The work most closely related to ours is that of Tomas et al. (2014). They investigated super-vised machine learning methods for language independent sarcasm detection by using features such as Ngrams, Patterns, POS Tags, Punctua-tions and emoticons. They used a large human-annotated Czech Twitter dataset containing 7,000 tweets with inter annotator agreement k=0.54. They significantly improved on limited bounded dataset consisting of balanced and imbalanced tweets using SVM classifier with a F-Score of 94 and 92 respectively. Christine et al. (2013) use intensifiers and exclamations to differentiate be-tween sarcastic and non-sarcastic tweets. They used a set of 3.3 million Dutch tweets and their classifier could correctly classify 101 tweets from set of 135 sarcastic tweets. They concluded that its fairly hard to differentiate sarcastic tweets from non-sarcastic in an open setting.

Another central class of models are those based on neural networks for sentiment analysis. These range from basic neural bag-of-words or bag-of-n-grams models to the more structured recursive and time-delay neural networks (So-cher et al., 2011; Socher et al., 2013;) These use semantic relations in trained word embeddings or simultaneously learns word embeddings mini-mizing its reconstruction cost from compositions and the objective function prediction error. Though there hasn't been any significant work on using neural networks in the past, such evidences of state-of-the-art results from applications of neural networks seems a promising approach for sarcasm detection.

Our method provides a balanced approach us-ing intensifiers, word features, sentiment and topic models, emoticons and contextual features. We improved a standard F-score from 60 to 64.

3 Data

3.1 Training and Test Set (Rohit)

In the micro-blogging site Twitter, people write short messages of up to 140 characters. In addition to the plain text, a tweet may contain links, user references (@<user>), hashtags which

are used to identify the important topics of the tweet (#PresidentObama, #GOT, #worldcup) or express sentiment of the user (#ecstatic, #sad, #sarcasm).

To build our corpus we collected and gathered (Tomas et al. 2014) around 40 thousand sarcastic tweets and around 170 thousand non-sarcastic tweets. A large amount of sarcastic tweets was collected from the existing research done in this area. We manually collected and labeled around 4000 sarcastic tweets.We used a Twitter API to collect tweets that had hashtag #sarcasm or #sar-castic from 10th to 20th April. The tweets without sarcasm hastags were considered as non-sarcastic data.

Before generating features from the collected data we did the following pre-processing on it. Firstly we filtered non-english tweets, retweets and discarded tweets containing links and urls. Next we used the TweetNLP to tokenize each tweet into relevant words, as it properly handles emoticons and other special sequences. We re-moved non-latin characters and some special characters such as ‘^%$&*-+’. We lowercased all words and trimmed words containing more than 3 consecutive character occurrences to 3 (e.g. whaaaaat? to whaaat?), thereby differentiat-ing between the expressional significance of the-se words compared to normal usage. We also tried generating Ngrams after stemming each word using snowball stemmer.

Our final corpus consisted of 25-75 training data (25% - 36,725 sarcastic and 75% - 110,123 non-sarcastic) tweets and testing set consisting of 4,483 sarcastic and 67,792 non-sarcastic tweets.

Dataset \ Tweets Sarcastic Non-Sarcastic

Training 36,725 110,123 Testing 4,483 67,792

Table 2: The tweet distribution in dataset

3.2 Word Embedding Corpus (Ankit)

Tweets generally contain numerous informally written words and spellings, that adds enormous noise to the implicit vocabulary actually been used. This creates a need of using large amount of data to learn word representations for each token with confidence. For this reason, we used an existing corpus of 467 million Twitter posts (Yang et al., 2011) from 20 million users cover-ing a 7 month period from June 1 2009 to De-cember 31 2009 for training our word embed-dings. These tweets were filtered to ignore punc-tuations, split compound-words and transformed

to lower case before being used to train word representations.

4 Technical Approach

We tested 2 methods to classify tweets as sarcas-tic or non-sarcastic. In first method, we focus on supervised machine learning approaches and evaluate their performance. We selected various n-grams, including unigrams, bigrams, trigrams with frequency greater than three (Liebrecht et al. 2013) and set of language independent fea-tures, including punctuation marks, emoticons quotes, character n-grams and skip-grams. In the second method, we exploit the semantic and syn-tactic word relations using Neural Networks, us-ing distributed word representations, various vec-tor composition techniques and exploiting the power of simple neural networks with continuous non-linear activation functions.

4.1 Classification (Rohit & Gaurav)

Our evaluation was primarily performed using Support Vector Machine with Linear Kernel and Maximum Entropy Classifier. We used Scikit-Learn SVM and MegaM package for Maximum entropy classifier. Experiments were conducted on different hyper-parameters on 1:15 (full – sar-castic:non-sarcastic) and 1:5 (small) sets of test-ing data. We use a huge spare binary matrix with given features as design matrix for classifiers.

4.1.1 Feature Engineering

4.1.1.1 Word Features and Sentiments (Rohit)

Features Description N-gram We used unigram, bi gram and

trigrams as binary features. The feature space is pruned by the minimum n-gram occurrence set to 3-50.

Char-n-grams We used character n-gram fea-tures (Blamey et al. 2012). We set the minimum occurrence from 5-50. We used set contain-ing 3-grams to 5-grams.

Skip-bigrams We used skip-bigrams (Guthrie et al. 2006)which skipped over gap size of 2. We only retain skip grams with frequency greater than 20.

Punctuations We adapted the approach fol-lowed by (Davidov et al. 2010). The feature set consists of num-ber of words, exclamation

marks, question marks, quota-tion marks normalized accord-ingly.

Topic Proba-bilities

We use Dato Graphlab to gener-ate topic word probabilities us-ing LDA and Gibbs Sampling and use the probability distribu-tion over words as a feature.

Partial Sen-timents

We divided the tweets into three sections and calculated senti-ments (polarity and subjectivity) for each part, individually. We used Textblob library.

Stemmed N-Grams

Each word is stemmed using conservative SnowBall stemmer and then used as ngrams fea-tures.

Table 3: Description of word feature set

4.1.1.2 Contextual Features (Gaurav) Features Description POS N-grams We experimented with use of

POS n-grams ranging from 3-grams to 5-grams. With mini-mum occurrence count set to 5.

Full Senti-ment

This feature encapsulates the subjectivity and polarity of the entire tweet using TextBlob Li-brary..

Sentiments trigrams

Use sentiment of trigram of tweets as feature

Partial Sen-timents

We divided the tweets into three sections and calculated senti-ments (polarity and subjectivity) for each part, individually

Initial and Terminal sen-timent

Use the sentiment of first 5 words and last 5 words as fea-ture.

Parse tree parent bi-grams

Use the bigrams of the word and its Parent in the parse tree as feature

Table 4: Description of contextual feature set

4.2 Language Modeling (Gaurav)

Build language models using n-grams of the tweets. It is observed that classifying the tweets using this model, result in very high recall and low precision for sarcastic tweets. We tried a combination of this model and model generated by classifying the tweets using Maximum entro-py in binomial mode to reduce wrongly classi-fied sarcastic tweets and increase the recall of the

combined model, while retaining the higher level of precision.

4.3 Neural Networks (Ankit)

Word-based Natural Language Understanding is a robust choice for NLP tasks like this, but such approach is fragile when working on noisy text data, as data sparsity becomes a significant challenge for word-based NLU due to the sheer number of possible tokens and usage.

One promising approach in natural language understanding that has recently emerged (Mikolovet al., 2013) is using unsupervised learning of distributed word representations to capture semantic relations in text. We explore training simple neural network model using these representations for sarcasm detection.

In another effort, we also learnt sarcasm-specific word-embeddings using Recursive Au-toencoder (Socher et al., 2011) and evaluated performance of sarcasm detection using the iden-tical setup.

4.3.1 Word Embedding

We use the popular open Word2Vec[1] Library to estimate the distributed word vector represen-tations for all words in the tweets. For this, we use the filtered corpus of 467 million tweets, training word2vec using skip-gram setting of window size 3 and 4 and used word representa-tion size of 100. These word embedding were then augmented with additional set of tokens and word-representations for each of the known emoticons we preserved in our training and test set. These representations for emoticons were mapped to manually tagged emoticon-word-token (e.g. amazed, angry, happy, joy, laugh). We used these word representations for word compositions to represent tweets.

4.3.2 Tweets - Vector Compositions

A bigger challenge in employing word repre-sentations for neural networks is to be able to estimate the distributed representations for sen-tences (tweets in our case). A tensor product of word vectors would be the best choice but it leads to two problems: firstly, the dimensions and vector size grows with number of words in sentence, and secondly, the representations vary for different length of input sequences. One sim-ple approach we took was simple point-wise ad-

1https://code.google.com/p/word2vec/

dition of word vectors for all words in the tweet (𝑤!𝑤!…𝑤!).

𝑓 𝑤!𝑤!…𝑤! [𝑖] = 𝑓 𝑤! [𝑖]!

!!!

One of the shortcomings of this approach was that it fails to capture word order information. For this reason, we used sum over circular con-volution on each pair of consecutive words in the tweet as representation of the tweet.

𝑔 𝑤! ,𝑤!!! = 𝑓 𝑤! ∗ 𝑓 𝑤!!! * represents circular convolution

𝑓 𝑤!𝑤!…𝑤! [𝑖] = 𝑔 𝑤! , 𝑤!!! [𝑖]!!!

!!!

We also considered using another variation to addition representation [Equation 1], by intro-ducing lexical information represented in the parse tree by weighing the representation of compositions at each internal node of the parse tree by the amount of word (leaf nodes) beneath the given tree, and thus represent the vector form for root node in similar fashion. For this, we use a Dependency Parser, TweeboParser (Lingpeng et al., 2014) and construct the representation in following way:

𝑓 𝑛𝑜𝑑𝑒! [𝑖] = 𝛼! ∗ 𝑓 k [𝑖]!!!"#$%& !" !

!

𝛼! = 𝛼!!!!"#$%& !" !! , 𝛼! = 1 for leaf nodes

Another common approach in composition is

the concatenation or addition of two or more combinations of individual approach. We con-sider combination of Lexical, Convolutional and Additive compositions weighted equally as an-other tweet vector composition.

4.3.3 Neural-Net structure

We evaluated the performance of word em-bedding and semantic relations using a simple 3 layer feed-forward neural network. We consid-ered choosing Continuous Non-Linear symmet-rical sigmoid activation function.

The three layered neural net consists of the in-put layer of size 100, a hidden (projection) layer of size 100 (and 50) and output layer of size 1. We hypothesized that the network would be able to project features onto hidden layer to be able to score the sarcasm within tweets from scale of -1 to 1 (sigmoid), 1 being most sarcastic. We trained the neural network using back-propagation though structure to minimize the prediction error, keeping the word representa-tions constant. The trained model was then used

to predict the output labels for given tweets en-coded in similar fashion as the training set.

For this study, we used an open, simple and fast neural network library[2].

4.3.4 Recursive Auto-encoder

Word representations are learnt recursively using the lexical parse tree (un-supervised or semi-supervised) using the recursive auto-encoder (Socher et al., 2011) minimizing the reconstruc-tion cost of the composing word vectors. This is, in our view, a novel approach for vector compo-sitions as it tries to minimize the reconstruction error while training the encoder, to be used later. This auto-encoder is then tweaked (Socher et al., 2011) to also predict the label for all internal and root nodes in the dependency tree and minimize the combined weighted error function. We use this network as another approach in classifying sarcastic tweets in its default settings[3].

5 Evaluation and Analysis

5.1 Base Model (Rohit)

To evaluate our model we used F-Score: a weighted average of Precision and Recall. To form a baseline we built a Naïve Bayes model using TF-IDF features on full test set (1:15) (Ta-ble 5). Precision Recall F-score Sarcasm 1.00 0.80 0.89 Non-Sarc 0.24 0.95 0.39

Table 5: Baseline Naive Bayes TF-IDF

Dataset \ Tweets Sarcastic Non-Sarcastic 1:5 Testing 4,483 22,415

1:15 Testing 4,483 67,792

Table 6: Two Testing sets

5.2 Classification Results

5.2.1 Support Vector Machine (Rohit)

POS + NGram 1:15 Testing set | C=0.01 Precision Recall F-score Sarcastic 0.48 0.57 0.52 Non-Sarc 0.97 0.96 0.96

2http://leenissen.dk/fann/wp/ 3http://www.socher.org/index.php/Main/Semi-SupervisedRecursiveAutoencodersForPredictingSen-timentDistributions

Table 7: Feature Set 1 | SVM

Char-NGrams + NGram 1:15 Testing set | C=0.01 Precision Recall F-score Sarcastic 0.50 0.51 0.50 Non-Sarc 0.97 0.97 0.97


SkipGrams + POS 1:15 Testing set | C=0.01 Precision Recall F-score Sarcastic 0.48 0.51 0.49 Non-Sarc 0.97 0.96 0.97


Puncs + NGrams + POS 1:15 Testing set | C=0.01 Precision Recall F-score Sarcastic 0.48 0.57 0.52 Non-Sarc 0.97 0.96 0.96


Stemmed NGrams + POS 1:15 Testing set | C=0.01 Precision Recall F-score Sarcastic 0.44 0.41 0.42 Non-Sarc 0.97 0.96 0.96


Sentiments + Puncs + Ngrams + POS 1:15 Testing set | C=0.01 Precision Recall F-score Sarcastic 0.50 0.55 0.52 Non-Sarc 0.97 0.96 0.97


Feature Reduction +Sentiment + Punc + Ngram + POS 1:15 Testing set | C=0.01 Precision Recall F-score Sarcastic 0.45 0.53 0.49 Non-Sarc 0.97 0.96 0.96


Topics + Sentiment + Punc + Ngram + POS 1:15 Testing set | C=0.01 Precision Recall F-score Sarcastic 0.46 0.56 0.50 Non-Sarc 0.97 0.96 0.96


Topics + Sentiment + Punc + Ngram + POS 1:5 Testing set | C=0.01 Precision Recall F-score Sarcastic 0.92 0.95 0.93 Non-Sarc 0.71 0.56 0.63

Table 15: Feature Set 9 | Smaller Set | SVM

Trimmed + Topics + Sentiment + Punc + Ngram + POS 1:5 Testing set | C=0.01 Precision Recall F-score Sarcastic 0.92 0.96 0.94 Non-Sarc 0.73 0.57 0.64

Table 16: Feature Set 10 | Smaller Set | SVM

5.2.2 Maximum Entropy(Gaurav) POS + NGram 1:15 Testing set | Binary classification Precision Recall F-score Sarcastic 0.38 0.62 0.47 Non-Sarc 0.97 0.93 0.95

Table 17: Feature Set 1 | MegaM

Char-NGrams + NGram 1:15 Testing set | Binary classification Precision Recall F-score Sarcastic 0.41 0.53 0.46 Non-Sarc 0.97 0.94 0.95


SkipGrams + POS 1:15 Testing set | Binary classification Precision Recall F-score Sarcastic 0.41 0.51 0.45 Non-Sarc 0.97 0.96 0.97


Puncs + NGrams + POS 1:15 Testing set | Binary classification Precision Recall F-score Sarcastic 0.38 0.62 0.47 Non-Sarc 0.97 0.93 0.95


Ngrams + POS + Sentiment 1:15 Testing set | Binary classification Precision Recall F-score Sarcastic 0.38 0.65 0.48 Non-Sarc 0.98 0.93 0.95


Ngrams + POS + Trigram Sentiment 1:15 Testing set | Binary classification Precision Recall F-score Sarcastic 0.38 0.60 0.46 Non-Sarc 0.97 0.93 0.95


Ngrams + POS + Partial Sentiment 1:15 Testing set | Binary classification Precision Recall F-score Sarcastic 0.41 0.61 0.49 Non-Sarc 0.97 0.94 0.96


Ngrams + POS + Initial Terminal Sentiment 1:15 Testing set | Binary classification Precision Recall F-score Sarcastic 0.41 0.61 0.49 Non-Sarc 0.97 0.94 0.96


Ngrams + (word + ParseTree root ) bigrams 1:15 Testing set | Binary classification Precision Recall F-score Sarcastic 0.18 0.13 0.15 Non-Sarc 0.97 0.94 0.95


Ngrams + POS + Initial Terminal Sentiment 1:15 Testing set | Binomial classification (0.55) Precision Recall F-score Sarcastic 0.43 0.59 0.50 Non-Sarc 0.97 0.95 0.96


Ngrams + POS + Initial Terminal Sentiment 1:15 Testing set | Binomial classification (0.60) Precision Recall F-score Sarcastic 0.46 0.56 0.50 Non-Sarc 0.97 0.96 0.96


Our model which includes trimmed words, topic probabilities, sentiments, punctuations, N-Grams and POS N-Grams significantly improves over baseline approach of TF-IDF Naïve Bayes. On small test set (1:5) we notice a F-Score of 64 and on large set (1:15) we notice a F-Score of 52 using SVM with linear kernel. This is a small improvement over the previous studies in sar-casm detection using tweets, which obtained a F-Score in range of 50-55.The previous work on

sarcasm detection uses only n-grams features and/or rule-based classification algorithms. Our analysis improves over those previous ones by incorporating a sentiment decomposition analysis and a topic modeling analysis. Moreover, we used a modern machine-learning algorithm, SVM, which is not rule-based.

5.3 Language modeling (Gaurav)

Language modeling BOW unigrams 1:15 Testing set Precision Recall F-score Sarcastic 0.17 0.85 0.28 Non-Sarc 0.99 0.72 0.83

Table 28: Language modeling unigrams

Language modeling BOW unigrams + (Ngrams + POS + Initial Terminal Sentiment | Binomial classification (0.55)) 1:15 Testing set Precision Recall F-score Sarcastic 0.48 0.57 0.52 Non-Sarc 0.97 0.96 0.97

Table 29: Language modeling unigrams + MegaM Binomial 0.55

Language modeling BOW unigrams + (Ngrams + POS + Initial Terminal Sentiment | Binomial classification (0.60)) 1:15 Testing set Precision Recall F-score Sarcastic 0.51 0.54 0.52 Non-Sarc 0.97 0.96 0.97

Table 30: Language modeling unigrams + MegaM Binomial 0.60

As expected the for the bag of words unigram language model, precision was low and recall was very high, resulting in an overall low f-score. But the combination of the languague model with the MegaM binomial model generat-ed from Ngrams+POS+Initial Terminal senti-ment as feature, had higher f-score than the accu-racies of individual models computed separately.

5.4 Performance of Neural Networks (Ankit)

A Simple neural network with composition by shear addition is able to achieve f-score of 47% and 63% on sarcasm for 15:1 and 5:1 test sets respectively. Whereas the most complex ap-proach using combination of lexical features with

addition and convolutional components in com-positions performs fairly similar.

Table 31: Results for 15:1 Test Set

Table 32: Results for 5:1 Test Set

We found that using only convolution gives

the worst results, whereas, simple composition by addition is able to hit the bar. We also find that using the lexical structure from dependency tree gives us an improvement in overall accuracy and f-score. This indicates that Sarcasm detec-tion in tweets in particular, is partially independ-ent of the word order.

Use of Recursive Autoencoder (RAE) in its default settings does not yield a competing result. In fact, the results were worst that any other re-sults found using a simple NN. One of the reason could be lack of training data for correctly esti-mating the parse tree in unsupervised mode and the inherent noise in tweets would add to poorly trained word representations and hence the auto-encoder. Nonetheless, it was able to perform very well on the training data achieving an aver-age F-score of 68% and accuracy of 80%. This indicates the potential of RAE in classifying sar-castic tweets given enough training data and stra-tegically chosen model settings, which we leave to future work.

5.5 Results Discussion

Figure 1: Comparison in fscore for Sarcastic

Tweets

Figure 2: Comparison of fscore for Non-

Sarcastic Tweets Figure 1 and 2 shows comparison of different

methods we studied. We find that text filtering, feature engineering and hyper-parameter selec-tion is very crucial in detection of sarcasm and its effect on the overall performance of the clas-sifiers. We also acknowledge how a simple neu-ral network is able to match the results from text analysis which opens a possibility of future work to outperform the existing results, by employing Recurrent Deep Neural Networks and train using Back-propagation though time.

6 Conclusion

Sarcasm detection in individual tweets is a diffi-cult task. Our study has shown that there is a possibility of improving the state-of-the-art re-

Precision Recall F-‐score Precision Recall F-‐scoreAdditive 0.40 0.59 0.47 0.97 0.94 0.96

Convolution 0.37 0.46 0.41 0.96 0.95 0.95Additive + Convolution 0.39 0.56 0.46 0.97 0.94 0.95ParseTree Additive 0.41 0.59 0.49 0.98 0.95 0.96

ParseTree (Conv. + Add.) 0.46 0.52 0.49 0.97 0.94 0.96

Approach Sarcasm Non Sarcasm

Precision Recall F-‐score Precision Recall F-‐scoreAdditive 0.67 0.59 0.63 0.94 0.94 0.94

Convolution 0.63 0.52 0.57 0.90 0.91 0.91Additive + Convolution 0.66 0.55 0.60 0.91 0.94 0.92ParseTree Additive 0.67 0.58 0.62 0.93 0.95 0.94

ParseTree (Conv. + Add.) 0.67 0.59 0.63 0.93 0.94 0.93

Non SarcasmSarcasmApproach

sults by extensive feature engineering and poten-tial use of deep neural networks.

One another approach to detect sarcasm would have been to consider cues from the tweets in the context, which we believe will be key to sarcasm detection.

Reference Ben Blamey, Tom Crick, and Giles Oatley. 2012. R U

: -) or : -( ? character- vs. word-gram feature selec-tion for sentiment classification of OSN corpora. In Proceedings of AI-2012, The Thirty-second SGAI International Conference on Innovative Techniques and Applications of Artificial Intelligence, pages 207–212. Springer.

Christine Liebrecht, Florian Kunneman, and Antal Van den Bosch. 2013. The perfect solution for de-tecting sarcasm in tweets #not. In Proceedings of the 4th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, WASSA 2013.

Dmitry Davidov, Oren Tsur, and Ari Rappoport. 2010. Semi-supervised recognition of sarcastic sen-tences in twitter and amazon. In Proceedings of the Fourteenth Conference on Computational Nat-ural Language Learning, CoNLL ’10, pages 107–116, Stroudsburg, PA, USA. Association for Com-putational Linguistics.

David Guthrie, Ben Allison, Wei Liu, Louise Guthrie, and YorickWilks. 2006. A closer look at skip-gram modelling. In Proceedings of the 5th international Conference on Language Resources and Evalua-tion (LREC- 2006), pages 1–4.

J. Yang, J. Leskovec. Temporal Variation in Online Media.In ACM International Conference on Web Search and Data Mining (WSDM '11), 2011.

Kreuz, R. J. and Caucci, G. M. 2007. Lexical influ-ences on the perception of sarcasm. In Proceedings of the Workshop on Computational Approaches to Figura- tive Language (pp. 1-4). Rochester, New York: As- sociation for Computational.

Nigam, K. and Hurst, M. 2006. Towards a Robust Me- tric of Polarity. In Computing Attitude and Af-fect in Text: Theory and Applications (pp. 265-279). Re- trieved February 22, 2010).

Pang, B. and Lee, L. 2008.Opinion Mining and Senti- ment Analysis. Now Publishers Inc, July.

Ptácek, Tomáš, Ivan Habernal, and Jun Hong."Sarcasm Detection on Czech and English Twitter."Proceedings of COLING 2014, the 25th International Conference on Computational Lin-guistics: Technical Papers, pages 213–223, Dub-lin, Ireland, August 23-29 2014.

Riloff E, Qadir A, Surve P, De Silva L, Gilbert N, Huang R (2013) Sarcasm as contrast between a positive sentiment and negative situation. In: Pro-ceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Asso-ciation for Computational Linguistics, Seattle, Washington, USA, pp 704-714

Roberto González-Ibáñez, SmarandaMuresan, and Nina Wacholder. 2011. Identifying sarcasm in Twitter: a closer look. In Proceedings of the 49th Annual Meeting of the Association for Computa-tional Linguistics: Human Language Technologies (ACL-HLT 2011), pages 581–586, Portland, Ore-gon, USA.

Socher, Richard, Pennington, Jeffrey, Huang, Eric H, Ng, Andrew Y, and Manning, Christopher D. Sem-isupervised recursive autoencoders for predicting sentiment distributions. In Proceedings of the Con-ference on Empirical Methods in Natural Lan-guage Processing, 2011.

Socher, Richard, Perelygin, Alex,Wu, Jean Y., Chuang, Jason, Manning, ChristopherD., Ng, An-drewY., and Potts, Christopher. Recursive deep models for semantic compositionality over a sen-timent treebank. In Conference on Empirical Methods in Natural Language Processing, 2013.

Tepperman, J., Traum, D., and Narayanan, S. 2006. Yeah right: Sarcasm recognition for spoken dia-logue systems. In InterSpeech ICSLP, Pittsburgh, PA.

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013).Efficient estimation of word representations in vector space. In: Proceedings of ICLR 2013.

Lingpeng Kong, Nathan Schneider, SwabhaSwayamdipta, Archna Bhatia, Chris Dyer, and Noah A. Smith.A Dependency Parser for Tweets. In: Proceedings of EMNLP 2014.

sarcasm detection in social media · sarcasm detection in social media ankit signhaniya 1282 w 29th...

Documents