sarcasm detection in text: design documentcsjie/cap/f16_des_rpt/sarcasmdesigndocu… · sarcasm in...

CSC 59866 Senior Design Project Specification

Professor Jie Wei

Wednesday, November 23, 2016

Sarcasm Detection in Text: Design Document

Jesse Feinman, James Kasakyan, Jeff Stolzenberg

1

Table of contents

Overview 2

Literature Review 3 Contextualized Sarcasm Detection on Twitter 3

Overview 3 Dataset 3 Features 4

Tweet features 4 Author features 4 Audience features 4 Response features 5

Results 5 Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon 6

Overview 6 Datasets 6

Twitter Dataset 6 Amazon Dataset 7

Classification Algorithm 7 Results 9

Design outline 9 Dataset 10 Classification 11

N-Gram Frequency Classification 11 Suffixes 11 Term Frequency-Inverse Document Frequency 12

Sentiment Analysis 12 N-Gram Sentiment 12

Capitalization, Punctuation, Hashtags, and Emoji 12 Hashtags and Emoji 13

Hashtags 13 Emoji 13

Long words and vowel-less words 14 Pattern Collection and Matching 15 Part of speech patterns 15

2

GloVe 15 N-Grams 16 Patterns 16 Part of Speech Patterns 16

Context 16 N-fold Cross-Validation 17

Neural Network 17

References 18

Overview

This document outlines our design methodology for building a classification model to detect

sarcasm in text. First, in the literature review section, we review recent attempts by

professionals in the field to tackle the problem of modeling sarcasm, and examine their

methodologies and the machine learning techniques used in their models. We review two

papers by Bamman and Smith [1] and Davidov et. al [2]. We then describe the design

methodology we intend to follow for our approach to the problem, drawing on the techniques in

the referenced papers but also expanding upon them with our own ideas.

3

Literature Review

Contextualized Sarcasm Detection on Twitter

Overview

This paper [1] written in 2015 by David Bamman and Nolan Smith , professors of Computer

Science at Carnegie Mellon University, attempts to tackle the problem of building a classification

model to detect sarcasm in tweets. Because of the unique structure of tweets, they are able to

gather data that is both pre-labeled as sarcastic, and that contains information about the context

of the text.

Dataset

When considering the source for their dataset, Bamman and Smith noted that in previous

attempts to design systems to classify sarcasm the datasets were labeled by human judges who

were prone to error , claiming they “found low agreement rates between human annotators at

the task of judging the sarcasm of others’ tweets” [1]. They also noted that previous attempts to

model sarcasm treated it as a text categorization problem, while they felt that “sarcasm requires

shared knowledge between speaker and audience; it is a profoundly contextual phenomenon”

[1]. For this reason, Bamman and Smith wanted to capture contextual features for their model.

To achieve these goals, they crawled through the last 3,200 tweets of all tweet authors between

a nine month period spanning 2013-2014. From this set, they took 9,767 tweets that were

replies to another tweet (context), and that contained at least three words and had #sarcastic or

4

#sarcasm as their final term. For the negative sample, they examined tweets during that same

time period that were not self labeled with #sarcastic or #sarcasm. This yielded a balanced

training set with 9,767 self labeled sarcastic tweets, and 9,767 non self labeled tweets.

Features

Features were divided into four classes according to the type of information they captured. The

four classes were tweet features, author features, audience features, and response features.

Tweet features

Tweet features are those derived completely from the text of the tweet to be predicted. These

include binary indicators of unigrams and bigrams, as well as binary indicators of unigrams and

bigrams in a reduced 1000 Brown cluster space. Part of speech features like ratio of nouns to

verbs and density of hashtags or emoticons are also included as tweet features, as well as

capitalization features, and both tweet level and word level sentiment features.

Author features

Author features are derived from information about the user who wrote the tweet to be

predicted. These include binary indicators of the top 100 terms in the author corpus scored by

TF-IDF. Bamman and Smith note that this is the single most informative feature, where a binary

logistic regression classifier scores an accuracy of 81.2% when trained only on this feature.

Other author features include profile information like gender and number of followers, as well as

historical author sentiment features.

Audience features

Audience features attempt to capture information about the shared context between the author

of the tweet to be predicted, and the author of the tweet being replied to. These included all the

5

features listed above as author features, but computed for the author of the original tweet that

was replied to by the author of the tweet being predicted. Features that capture the historical

communication between these two users like number of previous messages sent are also

included as audience features.

Response features

Response features are derived from information about the contents of the original and reply

tweets. These include binary indicators of pairwise Brown features between the two tweets, as

well as binary indicators of unigrams in the original tweet.

Results

Bamman and Smith trained binary logistic regression models on all possible combinations of

features. Using only tweet level features, their model achieved an average accuracy of 75.4%

across 10 fold cross validation. Adding response features increased the accuracy of the model

by under 2% to 77.3%, and combining tweet level features with audience features increased

accuracy by 3.6% to 79.0%. Combing tweet features and author features provided the largest

jump in accuracy, going from 75.4% using only tweet features to 84.9% when combining tweet

and author level features. This is just .2% lower than the accuracy of a model trained on all

features, which scored an accuracy of 85.1%.

From these results, Bamman and Smith conclude that capturing context is vital for models that

attempt to predict sarcasm, since the features design to capture context provide significant

improvements in accuracy over tweet level features alone.

6

Semi-Supervised Recognition of Sarcastic Sentences in Twitter and

Amazon

Overview

This paper [2] written in 2010 by Dmitry Davidov, Oren Tsur, and Ari Rappoport, PhD students

at The Hebrew University, focuses on using a semi-supervised approach to sarcasm

identification. This experiment was performed on two very different data sets, the first being a

set of tweets from Twitter and the second a collection of Amazon reviews. Utilizing sentences

that were ranked and pre-labeled based on level of sarcasm, the team constructed feature

vectors that were in turn used to build a classifier model that assigned scores to unlabeled

examples.

Datasets

Twitter Dataset

The first dataset that this team utilized came from Twitter. Twitter is a very popular

microblogging service. It allows users to publish and read short messages called tweets. Tweets

are restricted to 140 characters and may contain references to url addresses, references to

other Twitter users (these appear as @) and content tags (called hashtags) assigned by the

tweeter (#).

Due to Twitter’s informal nature and its constraint on character length the team found that users

are often forced to use a large amount of slang, shortened lingo, ascii emoticons and other

tokens absent from formal lexicons. The three experimenters stated that “These characteristics

make Twitter a fascinating domain for NLP applications, although posing great challenges due

7

to the length constraint, the complete freedom of style and the out of discourse nature of tweets”

[2].

The Twitter dataset that was used was comprised of 5.9 million unique tweets. In this dataset,

the average number of words per tweet was 14.2. Additionally, 18.7% of the tweets contained a

url, 35.5% contained a reference to another twitter user, and 6.9% contained at least one

hashtag.

Amazon Dataset

The second dataset that was used in this experiment was a collection of reviews from

Amazon.com. This dataset contained 66,000 reviews of 120 different products found on

Amazon.

The reason the researchers selected this dataset was because of its stark contrast to the Twitter

dataset. The Amazon reviews averaged 953 characters which are much longer than the tweets.

They were more structured and grammatical than tweets and are delivered in a known context.

Classification Algorithm

The algorithm used by this team of researchers was semi-supervised. The input was a small

seed of labeled sentences that had been annotated by three humans. The annotated sentences

were ranked on a scale from 1 to 5 in which a score of a 5 indicted a clearly sarcastic sentence

and a score of a 1 indicated a clear absence of sarcasm.

Once the team had the labeled sentences they extracted a set of features to be used in feature

vectors. The main feature types that were utilized were syntactic and pattern based features.

Feature vectors for each of the labeled examples in the training set were constructed and used

to build a classifier model that assigned scores to the unlabeled examples.

Data Preprocessing

8

The first aspect of the framework for the algorithm this team used was the preprocessing of the

data. To facilitate pattern matching the team had specific information replaced with meta data

tags. Each appearance of a product, author, company, book name, user, url, and hashtag were

replaced with the following corresponding generalized tags: ‘[PRODUCT]’, ‘[COMPANY]’,

‘[TITLE]’, ‘[AUTHOR]’, ‘[USER]’, ‘[LINK]’ and ‘[HASHTAG]’.

Pattern Extraction

The main feature type for the algorithm was based on surface patterns. The team classified

words into two types. The first type was high frequency words (HFW) for words with a frequency

greater than 1,000 words per million. The second type were content words (CW) for words with

a frequency of less than 100 words per million. A pattern was then defined as “an ordered

sequence of 2-6 HFW’s and 1-6 CW’s” [2].

Pattern Matching

Once patterns are identified, a single entry was constructed in the feature vectors for each

sentence. A feature value was then calculated for each pattern. An exact match to a sentence

labeled sarcastic in the training set scored a 1. Sparse and Incomplete matches scored slightly

lower respectively. Sentences with no pattern matches scored a zero.

Additional Features

In addition to pattern-based features some generic features were used as well. These included

the sentence length in words, the number of exclamation point “!” characters in the sentence,

9

the number of question mark “?” characters in the sentence, the number of quotes in the

sentence, and the number of capitalized words in the sentence.

Classification

Lastly the team needed to assign scores to the new examples in the test set. To do this they

use a k-nearest neighbors (kNN)-like strategy. Feature vectors were constructed for each

example in the training and test sets. For each feature vector v in the test set, they computed

the Euclidean distance to each of the matching vectors in the extended training set, where

matching vectors share at least one pattern feature with v. The score was then a weighted

average of the k closest training set vectors.

Results

The experiment conducted by these three PhD students yielded promising results. The

researchers found that on average, the semi-supervised algorithm achieved a precision of 77%

and a recall of 83.1%. They were surprised to find that punctuation marks served as the

weakest indicator for sarcasm. However, the use of excessive exclamation marks and capital

letters were moderately useful sarcasm indicators. The use of three consecutive dots which

when combined with other features constituted a strong predictor.

Design outline

Our methodology to detect sarcasm is based upon the research we have done. Going step by

step, from simpler to more advanced classification methods, we will evaluate the efficacy of

10

each using cross validation. The end result will be a system that uses a variety of classification

tools and which we have eliminated the classifiers which produced no benefit or hurt the results.

Dataset

The data we will gather to do the project with will likely be live streaming tweets that we gather

for a period of a few days or weeks until we have a sufficiently large amount to train our system

with. However twitter data is problematic, while the data is readily available there is little to no

context due to the short messages. In order to gain context with twitter data we need to look at

replies, past tweets, and the user’s profile. While these may be possible it is a more advanced

option that we hope to be able to get to by the end of the project, but using and gathering the

context for each tweet is more of a stretch goal. Ideally we will be able to attain a data set that

has more context in the surrounding text and does not require specific background of the actors.

To narrow the scope, using twitter data initially, we will select #sarcasm, #sarcastic, etc as well

as other hashtags that allow us to tailor our system to a specific niche area to focus upon. This

focus will make our system less generalizable but, in theory, be more accurate with that

particular data set. If we are able to attain a sufficiently high accuracy with a niche focus then

testing the system on a more general data selection would be the next goal. Our data will be

tagged as sarcasm or not sarcasm so we will be using primarily supervised learning techniques.

Tweets will require pre-processing to remove erroneous hashtags and replace proper nouns

with generics so that we are analyzing the sarcasm of the language not of the subject of the

message. We can also process the tweets with the proper nouns included in case subject

matter expertise yields better results.

11

Classification

We will be implementing multiple systems to attempt to classify sarcasm. Each system will

return a normalized value between 0 to 1 which can then be processed by a neural network.

Using python’s scipy and nltk libraries we will create classifiers that indicate on a scale of 0 to 1

with 0 being entirely non-sarcastic and 1 being completely sarcastic, what the confidence of an

individual classifier’s results are.

N-Gram Frequency Classification

We plan to explore using n-gram frequencies with different sizes of n. By looking at words and

phrases that are common in sarcastic remarks we hope to be able to train the system to

recognize and be able to classify sarcastic remarks.

Creating 2 frequency tables, 1 for sarcasm and 1 for non-sarcasm then comparing the

frequency of a given n-gram to the frequency tables. Returning the percent match to each of the

tables.

We would then repeat the n-gram frequency analysis with a lemmatized version of the message

to see if there’s a difference in results.

Suffixes

Using the same technique as n-gram frequency analysis we would also create a frequency

analysis using the suffixes of words in the tweets by lemmatizing the words and subtracting the

12

lemmatized word from the original word. For the same reasoning as n-gram frequency analysis,

we hope to discover patterns in suffix frequency that can help classify sarcastic remarks.

Term Frequency-Inverse Document Frequency

With the same n-grams above, we will also look at the frequency of each n-gram and its relation

to the inverse frequency of that term in the corpus of all messages we are looking at. This may

tell us the importance of certain terms and we may be able to draw trends towards some words

indicating sarcasm or not sarcasm. This is mainly useful on the tweets that have proper nouns

left in them to see if certain people, places, or things indicate sarcasm.

Sentiment Analysis

We are going to look into the sentiment of each message we are going to analyze. If we are

able to determine a trend in the sentiment that can help us classify sarcastic remarks then we

will include full message sentiment analysis. We will be using the Minqing Hu and Bing Liu’s

sentiment word list to train the sentiment system that is provided in NLTK.

N-Gram Sentiment

Whole message sentiment may not be very revealing but if we look at partial sentiment, looking

at n-gram sentiment of various sizes, we hope to be able to identify a trend in sarcastic remarks

that can help us better classify sarcasm.

Capitalization, Punctuation, Hashtags, and Emoji

Most of the time with NLP we would make everything the same case and not pay much

attention to punctuation. In our case we will attempt to find patterns in capitalization and

punctuation that can help us determine sarcasm classifications. For instance text that appears

13

in quotes may be treated differently than text that is surrounding the quoted message or multiple

exclamation and question marks may indicate a different meaning than a message without

them. This analysis will be difficult as it can be done both with the context of the words that are

surrounded by relevant punctuation or without the given word and just looking at the punctuation

patterns.

Hashtags and Emoji

With social media being more involved many users use hashtags and emoji to provide implicit

information that helps the reader understand the true intent of their message.

Hashtags

We will remove the obvious hashtags such as sarcasm, serious, and anything that can be used

to definitively identify a message as sarcasm but we will try and analyze the remaining hashtags

that are present to see if theres trends in hashtags that can elude to the sarcastic or serious

nature of a remark.

Emoji

There are 2 types of emoji that we will be looking at, the first is strings of non word characters

such as :) :’( >_< , etc which are supposed to represent faces and express an emotion towards

the topic that’s discussed. These will not necessarily be easy to identify and may require us to

compile a database of existing ones prior to analyzing, then looking for them in messages and

using frequency analysis and patterns to try and identify trends.

The second type are single character emoji that are often used by users from the emoji

selection on mobile keyboards. These come in as single characters, usually a unicode identifier,

14

that uniquely represents that symbol. Attempting to analyze the emoji by itself would be difficult

and would be futile given that they render differently on different systems. However we will take

the character identifier and use frequency and pattern analysis to try and derive the meaning of

the emoji as it relates to sarcastic remarks given the context in which it occurs in our training

data set.

For instance, if a positive sentiment message is followed by a particular emoji:

“Yay, Trump Won! ϝ ”

To us it’s clear that the message writer is not happy because that face is usually associated with

disappointment. WIth enough examples in a training set of emoji being used in context we hope

to be able to establish their meaning.

Additionally we have the option to pull a database of existing emoji and the words used to

describe them then assign them with sentiments based on the database. Using these

sentiments we could substitute the emoji for the given sentiment or for a synonym to the

emotion it’s intended to mean.

Long words and vowel-less words

Looking at words with a large number of syllables and words that don’t have vowels has shown

to be a possible method of sarcasm detection that we are going to explore. Looking at the

frequency of vowel-less words in a message and the frequency of messages with large

numbers of syllables may help us in our classification.

15

Pattern Collection and Matching

We will attempt to find phrase patterns that occur in sarcastic remarks but not, or far less

frequently, in non sarcastic remarks. Patterns are n-grams which have generic values for some

of the words. For instance “I went to the [generic].” would be a pattern. Then by checking with

the common words from our previous analysis, check if that pattern matches in the message

and if it does, if the generic term is high in our sarcasm frequency table.

Part of speech patterns

Previous research has suggested that the parts of speech and the frequency, density, and

patterns of those parts may be another useful tool. Using N-gram and pattern matching

analysis on the POS tags of a message we hope to be able to extract useful classification

information.

GloVe

The research we looked at used Brown clustering to establish context however we will be using

Global Vectors for Word Representation or GloVe to plot words in multidimensional vector

space based upon their context due to the more advanced nature of GloVe we are expecting

better results. When vectors are closer to one another in this space it means they have a more

similar context, which for us may indicate that words which occur in certain contexts are more

likely to be sarcastic or not.

16

N-Grams

Using the n-grams we previously looked at, we intend to compute the n-grams position in

multidimensional space with GloVe based on the context in which the n-grams. This can allow

us to draw similarities to the context of the n-grams we found to the trained n-grams to see if

there’s context that’s similar to sarcastic remarks.

Patterns

Using the same technique for the n-grams above, we would look for the context in which the

patterns occur in multidimensional space based on their context. By looking at the language

patterns that are previously discussed to see if certain patterns occur in concert with other

patterns that may indicate sarcasm.

Part of Speech Patterns

In the same way both the n-grams and patterns are checked in multidimensional space, we

would, too, check the part of speech patterns against this space.

Context

This is not as practical for tweets given the inherent lack of contextual information that they

provide however if possible, we would like to be able to analyze the surrounding sentences to a

potentially sarcastic remark then use that context to help drive our decision when classifying the

potentially sarcastic phrase. Using the sentiment and the subjects of the preceding and

succeeding sentence we may be able to establish that a given sentence is sarcastic.

17

N-fold Cross-Validation

With the various techniques we will be using to try and detect sarcasm we need to be able to

analyze which are working the best and why. Using cross validation on the various classifiers

we will use will help to validate the results of each individual tool and to allow us to fine tune

each one.

Neural Network

The number of individual classifiers we are going to try to explore is quite large so in order to try

and appropriately balance how much weight each classifier should get, besides basic hard

coded cross validation, we will create a neural network using the Scikit-learn library’s MLP

Classifier that can take the outputs of all these classifiers and integrate their outputs into a fully

connected neural network and adjust the weights within the network to give us optimal results in

as many circumstances as possible.

18

References

[1] D. Bamman and N. A. Smith, “Contextualized Sarcasm Detection on Twitter” in International

Conference on Web and Social Media, 2015 .

[2] D. Davidov, O. Tsur, and A. Rappaport, “Semi-Supervised Recognition of Sarcastic

Sentences in Twitter and Amazon” in Proceedings of the Fourteenth Conference on

Computational Natural Language Learning, 2010 , pp. 107-116.

sarcasm detection in text: design documentcsjie/cap/f16_des_rpt/sarcasmdesigndocu… · sarcasm in...

Documents