design presentation sarcasm detection: jeff stolzenberg...

of 41/41
Sarcasm detection: Design presentation The Wei The Truth and the Light Jesse Feinman James Kasakyan Jeff Stolzenberg

Post on 21-Jul-2020

1 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • Sarcasm detection:Design presentationThe Wei The Truth and the LightJesse FeinmanJames KasakyanJeff Stolzenberg

  • Overview

    ● Literature review○ Contextualized Sarcasm Detection on Twitter. Bamman, Smith. Carnegie Mellon University. 2015.○ Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon. Davidov et. al. Hebrew University.

    2010.

    ● Our approach

  • Contextualized Sarcasm Detection on Twitter

  • Dataset

  • Dataset

    9,767 self-labeled (#sarcastic) tweets 9,767 non self-labeled

  • Features: four classes

    Tweet features

    Word unigrams and bigrams

    Capitalization features

    Sentiment scores

  • Features: four classes

    Author features

    Most common author terms

    Most popular author topics

    Author historical sentiment

  • Features: four classes

    Audience features

    Author/audience topic overlap

    Historical communication

  • Features: four classes

    Environment features

    Unigram features of original message

  • Model: Binary logistic regression

  • Results

  • Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon

    •Experiment conducted by three PhD students from the Hebrew University of Jerusalem.

    •Goal: To create a classification algorithm for detecting sarcasm by looking for patterns.

    •Utilized a Semi-supervised framework for automatic identification of sarcastic sentences.•Performed the experiment on two very different data sets.

  • Datasets

    ● Twitter Dataset:○ 5.9 million tweets○ Tweets are 140 characters or fewer○ Tweets can contain urls, references to other tweeters (@) or

    hashtags #○ Slang, abbreviations, and emoticons are common○ Average of 14.2 words per tweet○ 18.9% include a url, 35.3% contain @○ 6.9% contain one or more hashtags

  • Datasets

    ● Amazon Dataset:○ This was made up of 66,000 reviews of 120 products○ 953 characters on average○ Usually structured and grammatical○ Have fields including writer, date, rating, and summary○ Amazon reviews have a great deal of context compared to

    tweets

  • Classification Algorithm

    ● Utilized a semi-supervised learning algorithm ○ Some of the training examples are labeled

    ● Three human annotators were used to create a small seed of labeled input data.

    ● A discrete score of 1-5 was assigned to each sentence.○ 5 – definitely sarcastic○ 1 - a clear absence of sarcasm

  • Classification Algorithm

    ● Given the labeled sentences, they extracted a set of features to be used in feature vectors.

    ● Two basic feature types were utilized: syntactic and pattern based features.

    ● Feature vectors were then constructed for each of the labeled examples in the training set.

    ● They were then used to build a classifier model and assign scores to unlabeled examples.

  • Data Preprocessing ● Specific information was replaced with general tags to facilitate pattern

    matching.○ ‘[PRODUCT]’,’[COMPANY]’,’[TITLE]’○ ‘[AUTHOR]’, ‘[USER]’,[LINK], and ‘[HASHTAG]’○ All HTML tags are removed

  • Pattern Extraction & Selection

    ● Classified words into:○ High Frequency Words(HFW) (ex. The, and, a, you, who)

    frequency > 1000 words/million

    ○ Content words (CW)frequency < 100 words/million

    ● Pattern : Ordered sequence of 2-6 HFWs and 1-6 CWs

  • Pattern Matching

    •Once patterns are selected, asingle entry is constructed in thefeature vectors.•For each sentence a featurevalue is calculated for eachpattern as seen in the table:

  • Punctuation-based Features

    ● 1. Sentence length in words.● 2. Number of “!” characters in the sentence .● 3. Number of “?” characters in the sentence.● 4. Number of quotes in the sentence.● 5. Number of capitalized/all capitals words in the sentence.

  • Creating a Training set of Vectors

    •Each Vector in the Training Set is a Multi-Dimensional Vector

    •Each Dimension is acalculated value according to the features

  • Creating a Vector from a New Sentence

    ● In order to assign a score to new examples in the test set a k-nearest neighbor (kNN)-like strategy was used.

    ● The score for a new instance is the weighted average of the k nearest training set vectors, measured using Euclidean distance

  • Results

    ● On average, The SASI algorithm achieved a precision of 77% and a recall of 83.1%.

    ● The researchers found that the use excessive exclamation marks and capital letters were moderately useful sarcasm indicators.

    ● According to the study the three most sarcastically reviewed items on Amazon were Sony noise cancelling earphones, Dan Brown’s Da Vinci Code, and Amazon’s own Kindle e-reader.

  • Our Approach

    ● Machine Learning with: ○ Suffix○ Vocabulary○ Outside information○ Sentiment○ Partial Sentiment○ Context

    ● Neural Network● Deeper learning on previous traits

    ○ And more

  • N-Gram Suffixes

    ● Smart-est● Smart-er● Smart

    ● Suffixes may be common in sarcastic remarks that are also hyperbole

    ● Frequency analysis○ nltk.FreqDist([str1[-4:], str2[-4:], … , strN[-4:]])

  • N-Gram Words

    ● Antonyms of very negative words○ Best/Worst○ Smart/Stupid

    ● Sarcasm may make more use of these antonyms than other text● N-Gram

    ○ Smartest Person (Bigram)○ Greatest Person Ever (Trigram)

    ● Frequency analysis○ str.split()

    ■ ListOfNTuples.append(“ “.join(str[:n]))■ str = str[n:]

    ○ nltk.FreqDist(ListOfTuples)

  • Term Frequency Inverse Document Frequency

    ● Looking for words which occur very rarely but when they do occur are strong indicators of sarcasm or seriousness

    ● Frequency analysis from n-gram analysis○ Divide the frequency score for each n-gram in a phrase by the prevalence rate of that

    phrase in the entire corpus

  • Sentiment

    You’re the smartest person in the world!

    ● Sentiment which is overly positive is possible sarcasm.● Using Minqing Hu and Bing Liu’s sentiment word list

    ○ sentimentWord = count(word in wordList)○ sentiment = sum(sentimentOfWords)/lengthOfPhrase

    ● Scores above a certain threshold (to be determined) will be flagged as potential sarcasm.

  • N-Gram Sentiment - Splitting Sentences

    You must feel great about yourself by beating an innocent personMixed sentiment

    You must feel great about yourself by beating an innocent personPositive Sentiment Negative Sentiment

    Same as before, but analyze parts of the sentence separately then compare the results.

  • N-Gram Sentiment - Splitting on puncuation

    I was going to say something extremely rough to Lorem Ipsum, to its family, and I said to myself, "I can't do it. I just can't do it. I would be inappropriate. I’m too nice." You’re disgusting. Lorem Ipsum's father was with Lee Harvey Oswald prior to Oswald's being, you know, shot.

    Analyze sentiment of statements before and after, if contradicting return a higher score.

  • Capitalization

    ● Performing the previous analysis both in all lowercase and in the case it was originally written

    ● Normalized Capitalization frequency○ Excluding the first character in a sentence and the first letter of proper nouns○ countOfCaps(str)/length(str)

    ● We can do this during sentiment analysis too○ Words which are all caps and indicate sentiment count more than normal-case words

  • Hashtags

    ● Excluding sarcasm/not sarcasm and related hashtags● Analyze hashtag frequency in the same way as other n-grams were

    analyzed○ Individual hashtags○ Combinations of hashtags

    ● Context dependent○ The system would have to be trained on the topic at hand to learn which hashtags may

    indicate seriousness for a given subject

  • Emoji

    ● Emoticons ○ Treat character as it’s unicode/ascii value○ Use it during sentiment analysis○ Include emoticons in the library of positive and negative terms/symbols○ Frequency analysis (treated as a gram during n-gram analysis)

    ● Text Emoji :D○ More difficult due to complexity○ Need to compile a list of existing text emoji○ Include text emoji string in the library of positive and negative terms/symbols○ Frequency analysis (treated as a gram during n-gram analysis)

  • Word length and Vowel-less words

    ● Frequency of long words○ Score words based on syllables○ Set a threshold for what’s considered long○ CountOfLongWords(str)/length(str)

    ● Frequency of vowel-less words○ Score words based on boolean, vowels present/not present○ CountOfVowelLess(str)/length(str)

  • Pattern Matching

    ● Looking for patterns in phrase format○ “My name is [NAME]”○ “Let’s go to the [PLACE]”

    ● Using part of speech tagging to identify generic components○ Names, places, proper nouns, etc○ Replace them with generics [GENERIC]

    ● Treat these replaced terms as n-grams and perform the same analysis● Part of speech replacement

    ○ “My name is Jesse”○ [My:Pronoun, name:Noun, is:Verb, Jesse:Proper Noun]

    ● Treat these replaced terms as n-grams and perform the same analysis

  • Brown Clustering

    ● Cluster words into similar word meaning

    ○ Determined by preceding words○ Represent word as binary string

    ○ Traverse the tree created by the binary strings for similarity

    ● Repeat the process with○ Patterns○ Part of Speech patterns

    water gas coal liquid acid sand carbon steam shalegreat big vast sudden mere sheer gigantic lifelong scantman woman boy girl lawyer doctor guy farmer teacherAmerican Indian European Japanese German Africanpressure temperature permeability density porosity

    0 the10 chased110 dog1110 mouse1111 cat

    10 chased11 dog11 mouse11 cat

  • N-Fold Cross Validation

    ● Split the data into n subsets● Train each component with a different combination of n-x subsets● Validate the performance of each system with the x subsets that weren’t

    used to train the component○ Using train_test_split from sklearn

  • Neural Network

    Take the output of each of the classifiers, normalize it if it isn’t already, and input it into a neural network

    ● Sklearn.neural_network○ MLPClassifier

  • ● Data

    ● Tweets tagged as #sarcasm and related hashtags○ Topic specific hashtag

    ■ Ex. #education○ Non specific tweets○ Cleaning erroneous data

    ■ Tweets that are miscategorized■ Tweets that are too short, are links, pictures, etc.

    ○ Preprocess■ Remove sarcasm and related hashtags■ Replace proper nouns with generics

    ● I’m John -> I’m [NAME]

  • Deliverable

    ● Command Line Interface● GUI● Stretch

    ○ Library for Sarcasm Detection that the user can train or use pre-trained

  • Time Table

    ● Dec○ Background Research

    ● Jan○ Acquire Dataset○ Process Data○ Design Framework

    ■ Test individual Components

    ● Feb-Mar○ Integrate