sarcasm detection with tensorflow1214412/fulltext02.pdf · other things be useful in marketing, ......

61
INOM EXAMENSARBETE TEKNIK, GRUNDNIVÅ, 7,5 HP , STOCKHOLM SVERIGE 2018 Sarcasm Detection with TensorFlow LUDVIG PERSSON JESPER LARSSON KTH SKOLAN FÖR ELEKTROTEKNIK OCH DATAVETENSKAP

Upload: vutuyen

Post on 11-Apr-2019

222 views

Category:

Documents


0 download

TRANSCRIPT

INOM EXAMENSARBETE TEKNIK,GRUNDNIVÅ, 7,5 HP

, STOCKHOLM SVERIGE 2018

Sarcasm Detection with TensorFlow

LUDVIG PERSSON

JESPER LARSSON

KTHSKOLAN FÖR ELEKTROTEKNIK OCH DATAVETENSKAP

Sarcasm Detection with

TensorFlow

LUDVIG PERSSON, JESPER LARSSON

Civilingenjör DatateknikDate: June 6, 2018Supervisor: Ric GlasseyExaminer: Örjan EkebergSwedish title: Upptäcka Sarkasm med TensorFlowSchool of Electrical Engineering and Computer Science

iii

Abstract

Sentiment analysis is the process of letting a computer guess the senti-ment of someone towards something based on a text. This can amongother things be useful in marketing, for example in the case of thecomputer figuring out that a certain person likes a certain product itcan present ads for similar products to the person. Sentiment analy-sis in social media is when the texts analyzed are from a social mediacontext like comments or posts on Twitter, Facebook, etc. One prob-lematic aspect of these texts is sarcasm. People tend to be sarcasticvery often in social media, with sarcasm being something that can behard to detect even for a human this does cause problems for the com-puter. This study was conducted with the intention of investigatinghow sarcasm detection can be performed in social media texts with thehelp of machine learning. For this purpose Google’s machine learningframework for Python, TensorFlow, was utilized. The machine learn-ing model created was a deep neural network with two hidden layerscontaining ten nodes each. As for the input a dataset of 4692 texts wereused with a 80/20 training/testing split. For preprocessing the textsinto a more suitable form for TensorFlow the methods Bag of Words,Bigrams and a naive method here refered to as Char for Char were con-sidered. However due to time constraints proper results from the moreadvanced approaches (Bigrams and Bag of Words) were not achieved.It was at least found that the rather simple approach was better thanexpected, with results notably better than 50% that would be highlyunlikely to achieve through sheer luck.

iv

Sammanfattning

Sentimentanalys är när en dator får till uppgift att gissa vad någontycker on någonting baserat på en text. Detta kan bland annat varaanvändbart för marknadsföring, till exempel i fallet då en dator listatut att en person tycker om en produkt kan den visa personen annon-ser för liknande produkter. Sentiment analys i sociala medier är närtexterna som analyseras är från sociala medier, som inlägg och kom-mentarer från facebook, twitter, etc. En problematisk aspekt av des-sa texter är sarkasm. Folk tenderar att vara sarkastiska ofta i socialamedier, samtidigt som sarkasm kan vara svårt att upptäcka även fören människa som läser texten. Denna studie genomfördes med avsik-ten att undersöka hur sarkasmdetektion kan genomföras på texter frånsociala medier med hjälp av maskininlärning. För det syftet användesGoogles maksininlärnings ramverk för Python: TensorFlow. Maskinin-lärningsmodellen som skapades med hjälp av ramverket var ett deepneural network med två hidden layers som består av tio noder var. Förinput användes ett dataset på 4692 texter med en 80/20 tränings/test-nings split. För att omvandla texterna till en form som är kompatibelmed TensorFlow togs metoderna Bag of Words, Bigrams, och en naivmetod här kallad Char for Char i beaktande. Tyvärr ledde brist på tidtill att ordentliga resultat från de mer avancerade metoderna Bag ofWords och Bigrams inte uppnådes. Däremot så ledde den naiva meto-den till resultat som skiljer sig markant från 50% och som skulle varaextremt osannolika att uppnå genom ren tur.

Contents

1 Introduction 11.1 Research Question . . . . . . . . . . . . . . . . . . . . . . 21.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Theory 32.1 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 Sentiments/Opinions . . . . . . . . . . . . . . . . 32.1.2 Sentiment analysis in social media . . . . . . . . . 42.1.3 Earlier work . . . . . . . . . . . . . . . . . . . . . . 4

2.2 TensorFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 6

2.3.1 Features and preprocessing . . . . . . . . . . . . . 62.3.2 Char for Char . . . . . . . . . . . . . . . . . . . . . 82.3.3 Bag of words . . . . . . . . . . . . . . . . . . . . . 92.3.4 Bigrams . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Method 113.1 Aquiring a dataset . . . . . . . . . . . . . . . . . . . . . . 113.2 Specifying the Model . . . . . . . . . . . . . . . . . . . . . 133.3 Dataset Transformation . . . . . . . . . . . . . . . . . . . . 133.4 Train-test split . . . . . . . . . . . . . . . . . . . . . . . . . 133.5 Training and evaluating . . . . . . . . . . . . . . . . . . . 13

4 Results 15

5 Discussion 17

6 Conclusion 19

v

vi CONTENTS

Bibliography 20

A Tabular results 22

B Source Code 24B.1 Char for Char . . . . . . . . . . . . . . . . . . . . . . . . . 24

B.1.1 premade_estimator.py . . . . . . . . . . . . . . . . 24B.1.2 iris_data.py . . . . . . . . . . . . . . . . . . . . . . 27B.1.3 pre_proc.py . . . . . . . . . . . . . . . . . . . . . . 32B.1.4 train_test_splitter.py . . . . . . . . . . . . . . . . . 35

B.2 bigram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38B.2.1 premade_estimator.py . . . . . . . . . . . . . . . . 38B.2.2 iris_data.py . . . . . . . . . . . . . . . . . . . . . . 41B.2.3 pre_proc.py . . . . . . . . . . . . . . . . . . . . . . 45B.2.4 train_test_splitter.py . . . . . . . . . . . . . . . . . 50

Chapter 1

Introduction

Sentiment analysis or sentiment classification within computer scienceis the process of letting software guess what the sentiment of the au-thor of a provided text is. The usual way of doing this is to let theprogram guess whether the sentiment is positive or negative. Some-times the classification of neutral sentiment alongside the other two isalso interesting.

Sentiment analysis has one potential use in giving insight into userpreferences, which could be useful in a wide array of applications suchas in advertising or when creating product rankings. As an example:ponder that your company has made a change to a product beloved bythe consumers and are worried what recpetion this action will have.The consumers will likely vent their feelings on social media, givingyou an insight into their reaction.

Automating this task with computers would allow one to analysemore reactions faster than compared a human. However, the analysisthe computer provides need to have a high accuracy to be a useful tool.Consider that a coin flip likely would have an accuracy of 50% whendeciding between two categories, such as positive and negative.

A challenge one faces when improving the accuracy of sentimentanalysis is the complexities of human language. Sarcasm, a commonway of expressing opinions in social and political discussion[13], hasbeen proposed to be one contributing factor in making sentiment anal-ysis hard to perform[9]. Even humans can often have trouble detectingsarcasm in text, due to many important cues signaling sarcasm, like fa-cial expression or tone of voice, not being present in written text. Beingsarcastic is, simplified, to say something meaning something else than

1

2 CHAPTER 1. INTRODUCTION

what is explicitly stated. More formally this report will rely on the fol-lowing definition of sarcasm from the authors of the dataset that willused in this text [12]:

A definition of sarcasm.

1. a sharp and often satirical or ironic utterance designed to behumorous, snarky, or mocking.

2. a mode of satirical wit depending for its effect on bitter,caustic, and often ironic language that is often directed againstan individual or a situation.

Successfully detecting sarcasm in a text could be used to improveprediction accuracy. When sarcasm is detected in a text the polarity ofthe sentiment prediction can be reversed: from positive to negative orvice versa.

1.1 Research Question

The aim of this study is to evaluate if we can develop a sarcasm de-tector for short form text commonly used in social media or internetforums. More concretley:

• Can machine learning be used on a dataset annotated either assarcastic or not sarcastic and achieve acceptable accuracy in judg-ing if a text is sarcastic or not?

• If yes, what kind of accuracy is it capable of?

1.2 Scope

This study will only consider a deep neural network model created byGoogle’s machine learning framework TensorFlow. The dataset thatwill be used is the sarcasm_v2 dataset[12]. It contains 4692 quote-response pairs labeled as sarcastic or non-sarcastic based upon if theresponse is sarcastic or not. Half of the dataset is labeled sarcastic.

Chapter 2

Theory

2.1 Sentiment Analysis

“Sentiment analysis, also called opinion mining, is the field of studythat analyzes people’s opinions, sentiments, evaluations, appraisals,attitudes, and emotions towards entities such as products, services, or-ganizations, individuals, issues, events, topics, and their attributes.”[13]

2.1.1 Sentiments/Opinions

According to A Practical Guide to Sentiment Analysis[3] “Sentimentanalysis mainly studies opinions that express or imply positive or neg-ative sentiment”. Further, opinions are defined as such that “An opin-ion is a quadruple, (g, s, h, t) where g is the sentiment target, s is thesentiment of the opinion about the target g, h is the opinion holder (theperson or organization who holds the opinion), and t is the time whenthe opinion is expressed”

Sentiment Analysis in Social Networks[13] similarly defines theterm “opinion” as such: “an opinion is a quintuple, (ei, aij,sijkl, hk,tl), where ei is the name of an entity, aij is an aspect of ei, sijkl is thesentiment on aspect aij of entity ei, hk denotes the opinion holder, andtl is the time when the opinion is expressed by hk.”

This second book further claims, just like Pozzi et al., that there issome who would call the subject “opinion mining” and some “sen-timent analysis”[13]. A sentiment would be “I like the color green”while an opinion would be “I think that green is a good color”, soalthough there is a difference it is rather subtle. It will therefore be

3

4 CHAPTER 2. THEORY

discussed here as if it is the same subject. Also, while these opinionsare a whole lot more complex this study will only be focusing on thesentiment itself.

2.1.2 Sentiment analysis in social media

A general description of the problems faced within this area of re-search: “In fact, social network sentiment analysis, in addition to in-heriting a multitude of issues from traditional sentiment analysis andnatural language processing, introduces further complexities (shortmessages, noisy content, metadata such as gender, location, and age)and new sources of information not leveraged in traditional approaches.”[13].

The metadata will not be relevant to this study, however the waythe messages are formatted and the language used when writing themwill be. Due to the informal writing featured in social media the lex-ical approach where you consult a lexicon for what sentimental valuea word holds, which has done very well for example when analysingmovie reviews [15], is no longer as useful of an approach without ex-tensive preprocessing[13].

Yang et al. worded it: “Most existing techniques rely on naturallanguage processing tools to parse and analyze sentences in a review,yet they offer poor accuracy, because the writing in online reviewstends to be less formal than writing in news or journal articles. Manyopinion sentences contain grammatical errors and unknown terms thatdo not exist in dictionaries.”[16] A machine learning approach doesnot suffer the same problems since it does not need to have data on ev-ery single word it can potentially encounter. What the machine learn-ing model does do instead is explained in the TensorFlow and NeuralNetworks sections.

2.1.3 Earlier work

A fair amount of research on the subjects of sentiment analysis andsarcasm detection has been performed, however we did not find muchresearch that was dedicated to the combination of the two. The tables2.1 and 2.2 shows results from sentiment analysis that was ran on justtwitter input vs sarcastic twitter input [5]. It can be seen that the suc-cess of determining sentiment for the sarcastic tweets is around 50% inthe 2014 results. 2015 results are a bit better for sarcasm detection but

CHAPTER 2. THEORY 5

System Twitter 2014 Sarcasm 2014TeamX 70.96 56.50

coooolll 70.14 46.66RTRGO 69.95 47.09

NRC-Canada 69.85 58.16TUGAS 69.00 52.87

CISUC_KIS 67.95 55.49SAIL 67.77 57.26

Table 2.1: Sentiment Analysis Task in F-Measure Terms for Both Regu-lar and Sarcastic Tweets in the 2014 Edition of SemEval.

System Twitter 2015 Sarcasm 2015Webis 64.84 53.59unitn 64.59 55.01lsislif 64.27 46.00

INESC-ID 64.17 64.91Splusplus 63.73 60.99

Table 2.2: Best Results in the Sentiment Analysis Task in F-MeasureTerms for Both Regular and Sarcastic Tweets in the 2015 Edition ofSemEval.

with lower success for nonsarcastic data.Another example of earlier results is a study on greek tweets about

the 2015 greek election[2]. They asked the general public to performthe task of annotation and ended up with about 4600 annotated tweets.The following results were achieved.

These results are better, however with data that has been annotatedby “134 different user sessions” meaning 134 or less different unknownindividuals.

Category Precision Recall f1-score Test SamplesNon-sarcastic 0.69 0.62 0.65 621

Sarcastic 0.72 0.78 0.75 772Average/total 0.70 0.71 0.70 1393

Table 2.3: Results of sentiment analysis on tweets from greek election.

6 CHAPTER 2. THEORY

2.2 TensorFlow

TensorFlow is a machine learning framework created by google. Itwill be utilized to create a deep neural network model which in turnis going to be used for predicting if a text is sarcastic or not. Ten-sorFlow provides a concept called Estimators, they are defined as: “ahigh-level TensorFlow API that greatly simplifies machine learningprogramming. Estimators encapsulate the following actions:

• training

• evaluation

• prediction

• export for serving

[4]

2.3 Neural Networks

A neural network can be defined as “A model that, taking inspira-tion from the brain, is composed of layers (at least one of which ishidden) consisting of simple connected units or neurons followed bynonlinearities”[10]. The first layer is the input layer, which is goingto contain the features the author of the neural network themselvesspecify. There is also the output layer, the final layer, which will be thelayer to present the answers the neural network has arrived at when itis done computing. In between these are the so called hidden layers.The idea is to have several hidden layers, each containing many nodes,which process data being provided to them. Accompanying the datais labels, giving the neural network feedback on if it was successful inlabeling the data. Based on this feedback values in the nodes of theneural network changes and in this way the neural network can learnto recognize patterns in the data.

2.3.1 Features and preprocessing

When used in TensorFlow the data will be in the form of a matrix. Ev-ery row in this matrix represents one input, in the case of this study

CHAPTER 2. THEORY 7

Figure 2.1: Structure of neural network.

8 CHAPTER 2. THEORY

inputs are strings which contain one message each from an online dis-cussion board. Every column in the matrix is one feature. What dataa feature will encompass is to be defined by the programmer, and willvary depending on which method for processing the input is beingused.

In the case of using strings as input some method for processing thedata must be applied, since TensorFlow only operates on numericaldata[6]. This transformation should, if possible, preserve the orderingof the input string due to the impact it may have on how the string isinterpreted. Two strings with the same words can either be interpretedas sarcastic or not depending on the word order, e.g. “yeah, right” and“right, yeah”.

There exists multiple ways of transforming strings to numericaldata. Some methods used in earlier works are Bag of Words (BOW)and its generalisation n-grams[3][1]. A naïve approach, here called,Char for Char is also utilized.

2.3.2 Char for Char

This is a method of processing the data where each character in theinput string is represented as a feature. When preprocessing the firststep is to find the longest input string, when ranking by number ofcharacters in the string. The number of features is then set to the lengthof this input string. Each string is split into single characters, witheach character becoming a feature. Finally the row is padded to thelength of the longest input with blankspace characters making up theremaining features.

Figure 2.2: Illustration of Char for Char.

CHAPTER 2. THEORY 9

Figure 2.3: Illustration of Bag of Words and Bigrams.

2.3.3 Bag of words

Bag of words is a very different method of processing data. This methoddoes not handle characters but entire words. Each unique word in allof the input data combined is considered a feature (column). What isthen counted for each input (row) is the number of occurrences of eachword in that input. If the word “car” occurs 4 times in an input stringthen the “car” feature for that input will contain the number 4.

2.3.4 Bigrams

The bag of words method and the Bigrams method are essentially thesame. They are both specialized cases of the N-grams method (wherebag of words would be considered “1-grams” or “unigrams”). TheN-grams method works the same way as described above for bag ofwords, but with the important difference that not just every word butrather each group of N adjacent words found in the input is consid-ered a feature. If an input for example contained the string “Machinelearning is fun”, the features “Machine learning”, “learning is” and“is fun” would be extracted if using a bigram model (which is to say:when N = 2).

2.4 Metrics

For measuring accuracy will be used. Accuracy is the amount of truepositives plus the amount of true negatives, divided by the total num-

10 CHAPTER 2. THEORY

ber of examples. True positives are all the instances classified as posi-tive that are actually positive, true negative all the instances classifiedas negative which are actually negative. Accuracy then is the percentof instances correctly classified[14].

Accuracy =TruePositives+ TrueNegatives

TotalNumberOfExamples

Chapter 3

Method

The approach in this study is divided into four parts – firstly aquiring adataset, secondly choosing a machine learning model, and then trans-forming that dataset to a suitable format for TensorFlow, and lastlyusing the transformed data to train and evaluate a machine learningmodel. This chapter will describe these processes.

3.1 Aquiring a dataset

Earlier work has compiled different datasets with text annotated assarcastic or not, e.g. the two that we found: the Sarcasm Corpus V2[12]and the Self Annotated Reddit Corpus (SARC)[8].

We compared the two datasets mainly based on their quality andease of use. The 4692 examples in the Sarcasm Corpus V2 was anno-tated by crowdsourcing, meaning independet people had gone throughand annotated every example. This dataset was easily available in ancomma seperated values file. SARC was, as the name hints, annotatedby the authors of the comments themselves. (There exists a culture onreddit to mark ones comment with ’/s’ to indicate that it is sarcastic.)The annotated comments are probably being sarcastic but we can notbe sure if the unmarked ones are sarcastic or not. This diminishes thequality of the dataset. A normally positive trait of the dataset is that itconsists of circa 1.3 million annotated comments. This unfortunateleyleads to SARC being somewhat unwieldly considering the time andcomputing we have at our disposal so we decided to use the the Sar-casm Corpus V2 for this report since we deemed it being of higherquality and quite easy to work with.

11

12 CHAPTER 3. METHOD

Corpus Label ID Quote Text Response TextGEN sarc GEN_sarc_0000 First off, That’s

grade A USDA ap-proved Liberalismin a nutshell.

Therefore you ac-cept that the Re-publican party al-most as a wholeis "grade A USDAapproved Liberal-ism." About timeyou did.

GEN sarc GEN_sarc_0001 watch it. Nowyou’re using mylines. Poet has al-ways been an easytarget, I will agree.;)

More chatteringfrom the peanutgallery? Haven’tgotten the memo,you’re no longera player? Hon-estly....clamoringfor attention is solow budget. Noshame.

RQ notsarc RQ_notsarc_0397 This pretty muchsums up peoplelike Penfold. Thisdifinitivly showsthat he believesyour right to ownfirearms shouldbe taken away.Thank heaven ourfounding fatherssought to protectus from the likesof him and enu-merated our rightto keep and beararms. Happy NewYear

Don’t be so faith-ful in our laws.Remember prohi-bition? Of course,I myself highlydoubt that Obamawill manage toban firearms, butmark my words,he will do any-thing in his powerto restrict them.

Table 3.1: Some examples from the Sarcasm Corpus V2

CHAPTER 3. METHOD 13

3.2 Specifying the Model

There exists many excellent machine learning frameworks. We choseto use Googles Tensorflow framework for python for this study. Ten-sorflow offers good documentation and tools, which made it easy forus to get started with it and creating our model. We based our modelon the model from the documentations Get Started tutorial. [7], creat-ing a Deep Neural Network [11] with two hidden layers, each with 10nodes. Refer to B.1.1, B.1.2, B.2.1 and B.2.2 in the appendix.

3.3 Dataset Transformation

Before being able to start training our model we had to transformour dataset into a format suitable for Tensorflow. In our research N-grams or one of its spezialized forms, i.e. bag of words/unigrams andbigrams, were often used.[3] We decided to try these preprocessingmethods as well as also try the naïve Char for Char method. We im-plemented the preprocessing methods in python (see B.1.3, B.2.3 formore details) and created our transformed datasets.

3.4 Train-test split

The last step before we could start training our model were splittingour data into one train and one test set. We decided to dedicate 80%(3754 examples) of our dataset to training and the remaining 20% totesting (938 examples). We made sure to keep the ratio of sarcasticcomments to nonsarcastic the same in the test and training subdatasetsas in the orignial dataset, i.e. 50/50, to eliminate unintentional biasingof the data. Otherwise the examples were randomly assigned to thetrain or test subdataset. For more information see B.1.4 and B.2.4.

3.5 Training and evaluating

With the dataset prepared for our choosen model we started tran-ing and evaluating it. Tensorflow contains simple tools for traininga model created with it for some amount of steps (updates of the mod-els weights), each step operating on a batch of examples. It contains

14 CHAPTER 3. METHOD

similar tools for evaluating the accuracy of the trained model. We con-tinously trained for 1000 steps at a time followed by an evaluation.Each step used a batch of 100 examples. We output the result of eachevaluation, i.e. the accuracy, to the standard output and when we ranthe program piped it to a file. We let the model train and evaluate overnight.

Chapter 4

Results

The results from training and evaluating the model on the dataset pre-procssed with Char for Char was an accuracy of circa 57%. Figure 4.1shows how the accuracy changed over the course of training. The ex-act data is available in table A.1 in the appendix.

The accuracy of the model trained on the dataset preprocesed withChar for Char turned out to be quite constant with respect to the amounttrained. The accuracy hovered around an average of approximately57%. The probability of getting this accuracy or better by flipping acoin for each 100 examples in a batch (50% probability with 100 trials)would be approximately 9.7%.

15

16 CHAPTER 4. RESULTS

Figure 4.1: Chart of accuracy at s steps.

When training the model with the dataset preprocessed with theBag of Words method not a single training+evaluation cycle completedin one night. It was then decided this this approach took too long timefor reliable results to be achieved. This was suprising due to the Bagof Words method being used successfully before. [3] The bigram pre-procssing method was also discarded at this point, due to the output ofthe bigram preprocessing growing even quicker than the Bag of Wordsmethod. The dataset processed with the Bag of Words method became527MB (an increase of 200 times compared to the original dataset)and seems correct upon inspection. The Char for Char dataset became23MB after preprocessing (an increase of almost 9 times) in compari-sion.

Chapter 5

Discussion

This chapter will begin by discussing wheter our results are reason-able. It will then continue by discussing them in the broader contextof sentiment analysis.

With simple methods and tools have we developed a Tensorflowmodel capable of correctly classifying 57% of the examples in the testsubdataset when preprocessing it with the Char for Char method. While"only" beating the expected accuracy of coin flipping by 7 percentagepoints, the probability of getting such a result is approximately 9.7%as shown in chapter 4: Results. This insight, togther with the fact thatthe the ratio of sarcastic to non-sarcastic examples in all of the datasets,alludes to the model doing something more complex than flipping acoin or everytime guessing the same category (always sarcastic or al-ways non-sarcastic). It is hard to compare our results with what othershave achived due to earlier work using the F1-score while we usedaccuarcy as our meteric. This was a misstake in the design of the ex-periment. If redone it should measure F1-score allowing the results tobe compared to earlier work.

That we were not able to achive any results using the Bag of Wordspreprocessing method is strange considering that its frequent appere-ance in earlier work. This indicates that our implementation probablyis faulty in some way. The fact that the data became an order of magni-tude larger when processed with the Bag of Word method comparedto the Char for Char method while seeming correct points to that itmight not deal with complexities of the data in a sufficent manner.Further techniques to simplify the data, such as stripping the text ofpunctuation, correcting misspellings, collapsing inflections to the base

17

18 CHAPTER 5. DISCUSSION

word, etc., could have been employed and might have improved theresults. These techniques might also be beneficial for the Char for Charmethod.

One could improve the experiment and the models results in it,but how would it translate to the wider context of sentiment analysis?This report have also not determined how the created model wouldgeneralise to different datasets, such as general social media. Wouldthe model be able to keep its accuarcy in a real world test compared tothe shielded enviroment in this experiment? Would it even matter in areal world application? Would the preprocessing methods themselvesgeneralise? Further, it would be interesting in the future to see howa sarcasm detector as the one presented in this report would affectsentiment analysis methods. Would they benefit from being able totreat sarcastic comments seperately? If yes, what level of accuracy isneccessary for it to improve the result of the sentiment analysis? Theseare some points that would be interesting to research in the future.

For the preprocessing methods there are some characteristic thatmight affect how they generalise. The Bag of Words method workson the specific vocabulary found in the dataset it trained on. If a newword would be found in the real world data the model would not beable to handle it. It is conceivable that it would be acceptable to dis-card such words or change them to a symonym found in the modelsvocabulary. The Char for Char method instead works on the alpha-bet allowing it to gracefully accept new words. Of course, if a newcharacter would appear this method would face the same problem asthe Bag of Words method faced when presented with a new word. Itdoes however seem more rare that new characters are introduced intothe language than new words. Given this, the Char for Char methodmight be more resilient in a real world scenario.

Chapter 6

Conclusion

Given the results of this report we conclude that it is possible to usemachine learning, TensorFlow specifically, to detect sarcasm in shortfree form text found in social media. Our model, created with naïvemethods and tools, achieve an accuracy of 57%. We speculate that thisresult could be improved upon with more extensive preprocessing ormore traditional preprocessing methods.

19

Bibliography

[1] Basant Agarwal and Namita Mittal. “Introduction”. In: Promi-nent Feature Extraction for Sentiment Analysis. Springer, 2016, pp. 1–4.

[2] Despoina Antonakaki et al. “Social media analysis during polit-ical turbulence”. In: PloS one 12.10 (2017), e0186836.

[3] Erik Cambria et al. A practical guide to sentiment analysis. Vol. 5.Springer, 2017.

[4] Estimators | TensorFlow. URL: https://www.tensorflow.org/programmers_guide/estimators (visited on ).

[5] DI Hernández Farias and Paolo Rosso. “Irony, sarcasm, and sen-timent analysis”. In: Sentiment Analysis in Social Networks. Else-vier, 2017, pp. 113–128.

[6] Feature Columns | TensorFlow. URL: https://www.tensorflow.org/get_started/feature_columns (visited on ).

[7] Get Started with Graph Execution | TensorFlow. URL: https://www.tensorflow.org/get_started/get_started_for_beginners (visited on ).

[8] Mikhail Khodak, Nikunj Saunshi, and Kiran Vodrahalli. “A largeself-annotated corpus for sarcasm”. In: arXiv preprint arXiv:1704.05579(2017).

[9] Edwin Lunando and Ayu Purwarianti. “Indonesian social mediasentiment analysis with sarcasm detection”. In: Advanced Com-puter Science and Information Systems (ICACSIS), 2013 InternationalConference on. IEEE. 2013, pp. 195–198.

[10] Machine Learning Glossary | Google Developers. URL: https://developers.google.com/machine-learning/glossary/#neural_network (visited on ).

20

BIBLIOGRAPHY 21

[11] Machine Learning Glossary | Google Developers. URL: https://developers.google.com/machine-learning/glossary/#deep_model (visited on ).

[12] Shereen Oraby et al. “Creating and characterizing a diverse cor-pus of sarcasm in dialogue”. In: arXiv preprint arXiv:1709.05404(2017).

[13] Federico Alberto Pozzi et al. Sentiment analysis in social networks.Morgan Kaufmann, 2016.

[14] Claude Sammut and Geoffrey I Webb. Encyclopedia of machinelearning. Springer Science & Business Media, 2011.

[15] Tun Thura Thet, Jin-Cheon Na, and Christopher SG Khoo. “Aspect-based sentiment analysis of movie reviews on discussion boards”.In: Journal of information science 36.6 (2010), pp. 823–848.

[16] Christopher C Yang et al. “Understanding online consumer re-view opinions with sentiment analysis using machine learning”.In: Pacific Asia Journal of the Association for Information Systems 2.3(2010).

Appendix A

Tabular results

22

APPENDIX A. TABULAR RESULTS 23

After # of steps Char for Char1000 0,5735612000 0,5682303000 0,5692964000 0,5671645000 0,5682306000 0,5682307000 0,5703628000 0,5703629000 0,57142910000 0,57249511000 0,57249512000 0,57036213000 0,57142914000 0,57036215000 0,57036216000 0,57036217000 0,57142918000 0,57462719000 0,57462720000 0,57249521000 0,57462722000 0,57569323000 0,57569324000 0,57569325000 0,57569326000 0,57569327000 0,57569328000 0,57569329000 0,57569330000 0,57569331000 0,57569332000 0,57569333000 0,57462734000 0,57462735000 0,57356136000 0,57356137000 0,57356138000 0,57356139000 0,57356140000 0,57249541000 0,57249542000 0,572495

Table A.1: Accuracy of ML model after n steps.

Ap

pen

dix

B

So

urc

eC

od

e

B.1

Ch

ar

for

Ch

ar

B.1

.1p

re

ma

de

_e

sti

ma

tor.

py

#C

opyr

igh

t20

16Th

eT

enso

rFlo

wA

uth

ors

.A

llR

igh

tsR

eser

ved

.# #

Lic

ense

du

nder

the

Apa

che

Lic

ense

,V

ersi

on2.

0(t

he

"Lic

ense

");

#yo

um

ayn

otu

seth

isfi

leex

cep

tin

com

pli

ance

wit

hth

eL

icen

se.

#Y

oum

ayo

bta

ina

cop

yo

fth

eL

icen

seat

# #h

ttp

://w

ww

.ap

ach

e.o

rg/

lice

nse

s/LI

CEN

SE�

2.0

24

APPENDIX B. SOURCE CODE 25

# #U

nle

ssre

qu

ired

byap

pli

cab

lela

wor

agre

edto

inw

riti

ng

,so

ftw

are

#d

istr

ibu

ted

und

erth

eL

icen

seis

dis

trib

ute

don

an"A

SIS

"BA

SIS

,#

WIT

HO

UT

WA

RRA

NTI

ESO

RC

ON

DIT

ION

SO

FA

NY

KIN

D,

eith

erex

pre

ssor

imp

lied

.#

See

the

Lic

ense

for

the

spe

cifi

cla

ngu

age

gov

ern

ing

per

mis

sio

ns

and

#li

mit

ati

on

su

nder

the

Lic

ense

.""

"An

Exa

mpl

eo

fa

DN

NC

lass

ifie

rfo

rth

eIr

isd

atas

et."

""fr

om__

futu

re__

imp

ort

abso

lute

_im

po

rtfr

om__

futu

re__

imp

ort

div

isio

nfr

om__

futu

re__

imp

ort

pri

nt_

fun

ctio

n

imp

ort

arg

par

seim

por

tte

nso

rflo

was

tf

imp

ort

iris

_d

ata

par

ser

=ar

gp

arse

.Arg

um

entP

arse

r()

par

ser

.ad

d_a

rgu

men

t(’�

�b

atch

_siz

e’,

def

ault

=10

0,

typ

e=in

t,

hel

p=

’bat

chsi

ze’)

par

ser

.ad

d_a

rgu

men

t(’�

�tr

ain

_ste

ps

’,d

efau

lt=

1000

,ty

pe=

int

,h

elp

=’n

umbe

ro

ftr

ain

ing

step

s’)

par

ser

.ad

d_a

rgu

men

t(’�

�m

odel

_dir

’,d

efau

lt=

’mod

els/

cfc

’,ty

pe=

str

,

26 APPENDIX B. SOURCE CODE

hel

p=

’dir

ecto

ryto

sav

em

odel

chec

kp

oin

ts’)

def

mai

n(a

rgv

):ar

gs

=p

arse

r.p

arse

_arg

s(a

rgv

[1:]

)

#F

etch

the

dat

a(t

rain

_x,

trai

n_y

),(t

est_

x,

test

_y)

=ir

is_

da

ta.l

oad

_dat

a()

#F

eatu

reco

lum

nsd

escr

ibe

how

tou

seth

ein

pu

t.m

y_fe

atu

re_c

olu

mn

s=

[]fo

rke

yin

trai

n_x

.key

s()

:m

y_fe

atu

re_c

olu

mn

s.a

ppen

d(

tf.f

eatu

re_c

olu

mn

.in

dic

ato

r_co

lum

n(

tf.f

eatu

re_c

olu

mn

.cat

ego

rica

l_co

lum

n_w

ith

_vo

cab

ula

ry_f

ile

(ke

y=ke

y,

vo

cab

ula

ry_f

ile=

iris

_d

ata

.VO

CABU

LARY

_PA

TH))

)

#B

uil

d2

hid

den

lay

erDN

Nw

ith

10,

10u

nit

sre

spec

tiv

ely

.cl

ass

ifie

r=

tf.e

stim

ato

r.D

NN

Cla

ssif

ier(

feat

ure

_co

lum

ns=

my_

feat

ure

_col

um

ns

,#

Two

hid

den

lay

ers

of

10no

des

each

.h

idd

en_u

nit

s=[1

0,

10],

#T

hem

odel

mus

tch

oose

betw

een

3cl

ass

es.

n_c

lass

es=

2,

mod

el_d

ir=

arg

s.m

odel

_dir

)

APPENDIX B. SOURCE CODE 27

wh

ile

Tru

e:

#T

rain

the

Mod

el.

cla

ssif

ier

.tra

in(

inp

ut_

fn=l

ambd

a:i

ris_

da

ta.t

rain

_in

pu

t_fn

(tra

in_x

,tr

ain

_y,

arg

s.b

atch

_siz

e),

step

s=ar

gs

.tra

in_

step

s)

#E

val

uat

eth

em

odel

.ev

al_

resu

lt=

cla

ssif

ier

.ev

alu

ate

(in

pu

t_fn

=lam

bda

:iri

s_d

ata

.ev

al_i

np

ut_

fn(t

est_

x,

test

_y,

arg

s.b

atch

_siz

e))

pri

nt(

’{ac

cura

cy:f

}\n

’.fo

rmat

(⇤⇤

eva

l_re

sult

))

if__

nam

e__

==’_

_mai

n__

’:tf

.lo

gg

ing

.set

_ver

bo

sity

(tf

.lo

gg

ing

.FA

TAL

)tf

.app

.ru

n(m

ain

)

B.1

.2ir

is_

da

ta.p

y

imp

ort

pan

das

aspd

imp

ort

ten

sorf

low

astf

imp

ort

csv

28 APPENDIX B. SOURCE CODE

TRA

IN_P

ATH

=".

./re

s/tr

ain

.csv

"TE

ST_P

ATH

=".

./re

s/te

st.c

sv"

HEA

DER

S_PA

TH=

"../

res/

hea

der

s.c

sv"

VOCA

BULA

RY_P

ATH

=".

./re

s/v

ocab

ula

ry.t

xt"

SPE

CIE

S=

[’Se

tosa

’,’V

ersi

colo

r’,

’Vir

gin

ica

’]

def

load

_col

um

n_na

mes

():

hea

der

s=

[]

wit

hop

en(H

EAD

ERS_

PATH

,’r

’)as

hea

der

s_fi

le:

read

er=

csv

.rea

der

(hea

der

s_fi

le)

for

row

inre

ader

:h

ead

ers

.app

end

(row

)

retu

rnh

ead

ers

[0]

def

load

_dat

a(y

_nam

e=’l

abel

’):

"""R

etu

rns

the

iris

dat

aset

as(t

rain

_x,

trai

n_y

),(t

est_

x,

test

_y).

"""

col_

nam

es=

load

_col

um

n_na

mes

()

tra

in=

pd.r

ead

_csv

(TRA

IN_P

ATH

,na

mes

=co

l_n

ames

,h

ead

er=

0,

qu

otin

g=

csv

.QU

OTE

_ALL

)

APPENDIX B. SOURCE CODE 29

tra

in.p

op("

13

11

")#T

OD

Odo

not

pop

col

,bu

tg

otna

nin

ittr

ain

_x,

trai

n_y

=tr

ain

,tr

ain

.pop

(y_n

ame)

test

=pd

.rea

d_c

sv(T

EST_

PATH

,na

mes

=co

l_n

ames

,h

ead

er=

0,

qu

otin

g=

csv

.QU

OTE

_ALL

)te

st.p

op("

13

11

")#T

OD

Odo

not

pop

col

,bu

tg

otna

nin

itte

st_x

,te

st_y

=te

st,

test

.pop

(y_n

ame)

retu

rn(t

rain

_x,

trai

n_y

),(t

est_

x,

test

_y)

def

trai

n_i

np

ut_

fn(f

eatu

res

,la

bel

s,

bat

ch_s

ize

):""

"An

inp

ut

fun

ctio

nfo

rtr

ain

ing

"""

#C

onve

rtth

ein

pu

tsto

aD

atas

et.

dat

aset

=tf

.dat

a.D

atas

et.f

rom

_ten

sor_

slic

es((

dic

t(fe

atu

res

),la

bel

s))

#Sh

uff

le,

rep

eat

,an

db

atch

the

exam

ple

s.

dat

aset

=d

atas

et.s

hu

ffle

(50

00

).re

pea

t()

.bat

ch(b

atch

_siz

e)

#R

etu

rnth

ed

atas

et.

retu

rnd

atas

et

def

eval

_in

pu

t_fn

(fea

ture

s,

lab

els

,b

atch

_siz

e):

"""A

nin

pu

tfu

nct

ion

for

eval

uat

ion

orp

red

icti

on

"""

30 APPENDIX B. SOURCE CODE

feat

ure

s=d

ict(

feat

ure

s)

ifla

bel

sis

Non

e:

#N

ola

bel

s,

use

only

feat

ure

s.

inp

uts

=fe

atu

res

else

: inp

uts

=(f

eatu

res

,la

bel

s)

#C

onve

rtth

ein

pu

tsto

aD

atas

et.

dat

aset

=tf

.dat

a.D

atas

et.f

rom

_ten

sor_

slic

es(i

np

uts

)

#B

atch

the

exam

ple

sa

sser

tb

atch

_siz

eis

not

Non

e,"b

atch

_siz

em

ust

not

beN

one"

dat

aset

=d

atas

et.b

atch

(bat

ch_s

ize

)

#R

etu

rnth

ed

atas

et.

retu

rnd

atas

et

#Th

ere

mai

nd

ero

fth

isfi

leco

nta

ins

asi

mp

leex

amp

leo

fa

csv

par

ser

,#

imp

lem

ente

du

sin

ga

the

‘Dat

aset

‘cl

ass

.

#‘t

f.p

arse

_csv

‘se

tsth

ety

pes

of

the

outp

uts

tom

atch

the

exam

ple

sg

iven

in#

the

‘rec

ord

_def

ault

s‘

argu

men

t.C

SV_T

YPE

S=

[[0

.0]

,[0

.0]

,[0

.0]

,[0

.0]

,[0

]]

APPENDIX B. SOURCE CODE 31

def

_par

se_l

ine

(lin

e):

#D

ecod

eth

eli

ne

into

its

fie

lds

fie

lds

=tf

.dec

ode_

csv

(lin

e,

reco

rd_d

efau

lts=

CSV

_TY

PES

)

#P

ack

the

resu

ltin

toa

dic

tio

nar

yfe

atu

res

=d

ict(

zip

(loa

d_c

olu

mn_

nam

es()

,fi

eld

s))

#Se

par

ate

the

lab

elfr

omth

efe

atu

res

lab

el=

feat

ure

s.p

op(’

Spec

ies

’)

retu

rnfe

atu

res

,la

bel

def

csv

_in

pu

t_fn

(csv

_pat

h,

bat

ch_s

ize

):#

Cre

ate

ad

atas

etco

nta

inin

gth

ete

xt

lin

es.

dat

aset

=tf

.dat

a.T

extL

ineD

atas

et(c

sv_p

ath

).sk

ip(1

)

#P

arse

each

lin

e.

dat

aset

=d

atas

et.m

ap(_

par

se_l

ine

)

#Sh

uff

le,

rep

eat

,an

db

atch

the

exam

ple

s.

dat

aset

=d

atas

et.s

hu

ffle

(50

00

).re

pea

t()

.bat

ch(b

atch

_siz

e)

32 APPENDIX B. SOURCE CODE

#R

etu

rnth

ed

atas

et.

retu

rnd

atas

et

B.1

.3p

re

_p

roc

.py

imp

ort

csv

TXT

=’.

./re

s/v

ocab

ula

ry.t

xt’

PRO

CES

SED

_DA

TA_C

SV=

’../

res/

pro

cess

ed_d

ata

.csv

RES_

HEA

DER

S_C

SV=

’../

res/

hea

der

s.c

sv’

SOU

RCE

=".

./re

s/sa

rcas

m_v

2.c

sv"

max

_len

gth

=0

skip

ped

_hea

din

gs

=F

alse

wit

hop

en(S

OU

RCE

,’r

’)as

corp

us

:sa

rcas

m_r

ead

er=

csv

.rea

der

(cor

pu

s,

del

imit

er=

’,’)

for

row

insa

rcas

m_r

ead

er:

ifn

otsk

ipp

ed_h

ead

ing

s:

skip

ped

_hea

din

gs

=T

rue

con

tin

ue

APPENDIX B. SOURCE CODE 33

tex

t=

row

[4]

ifle

n(t

ext)

>m

ax_l

engt

h:

max

_len

gth

=le

n(t

ext)

skip

ped

_hea

din

gs

=F

alse

pro

cess

ed_r

ows

=[]

wit

hop

en(S

OU

RCE

,’r

’)as

corp

us

:sa

rcas

m_r

ead

er=

csv

.rea

der

(cor

pu

s,

del

imit

er=

’,’)

for

row

insa

rcas

m_r

ead

er:

ifn

otsk

ipp

ed_h

ead

ing

s:

skip

ped

_hea

din

gs

=T

rue

con

tin

ue

tex

t=

row

[4]

lab

el=

row

[1]

tex

t_le

n=

len

(tex

t)

pro

cess

ed_r

ow=

[]

ifla

bel

=="n

ots

arc

":p

roce

ssed

_row

.app

end

(0)

else

: pro

cess

ed_r

ow.a

ppen

d(1

)

34 APPENDIX B. SOURCE CODE

for

char

inte

xt

:p

roce

ssed

_row

.app

end

(ch

ar)

wh

ile

len

(pro

cess

ed_r

ow)

<m

ax_l

engt

h:

pro

cess

ed_r

ow.a

ppen

d(’

’)

pro

cess

ed_r

ows

.app

end

(pro

cess

ed_r

ow)

hea

der

s=

[]

hea

der

s.a

ppen

d("

lab

el")

for

iin

ran

ge(m

ax_l

engt

h):

hea

der

s.a

ppen

d(i

)

wit

hop

en(R

ES_H

EAD

ERS_

CSV

,’w

’)as

hea

der

s_fi

le:

wri

ter

=cs

v.w

rite

r(h

ead

ers_

file

)w

rite

r.w

rite

row

(hea

der

s)

wit

hop

en(P

ROC

ESSE

D_D

ATA

_CSV

,’w

’)as

sav

e_fi

le:

wri

ter

=cs

v.w

rite

r(s

ave_

file

,d

elim

iter

=’,

’,q

uot

ing

=cs

v.Q

UO

TE_A

LL)

APPENDIX B. SOURCE CODE 35

wri

ter

.wri

tero

w(h

ead

ers

)

for

pro

cess

ed_r

owin

pro

cess

ed_r

ows

:w

rite

r.w

rite

row

(pro

cess

ed_r

ow)

voc

={}

for

row

inp

roce

ssed

_row

s:

for

iin

ran

ge(1

,le

n(r

ow))

:vo

c[ro

w[i

]]=

0

wit

hop

en(T

XT

,’w

’)as

vo

c_fi

le:

keys

=li

st(v

oc.k

eys

())

keys

.so

rt()

for

key

inke

ys:

vo

c_fi

le.w

rite

(key

)v

oc_

file

.wri

te("

\n

")

B.1

.4tr

ain

_te

st_

sp

litt

er.

py

imp

ort

csv

imp

ort

rand

om

RES

_TES

T_C

SV=

’../

res/

test

.csv

36 APPENDIX B. SOURCE CODE

RES

_TR

AIN

_CSV

=’.

./re

s/tr

ain

.csv

PRO

CES

SED

_DA

TA_C

SV=

’../

res/

pro

cess

ed_d

ata

.csv

wit

hop

en(P

ROC

ESSE

D_D

ATA

_CSV

,’r

’)as

pro

c_d

ata_

file

:re

ader

=cs

v.r

ead

er(p

roc_

dat

a_fi

le,

del

imit

er=

’,’,

qu

otin

g=

csv

.QU

OTE

_ALL

)

dat

a=

[]fo

rro

win

read

er:

dat

a.a

ppen

d(r

ow)

hea

der

s=

dat

a.p

op(0

)

no

nsa

rc_d

ata

=li

st(

filt

er

(lam

bda

x:

x[0

]==

’0’,

dat

a))

sarc

_dat

a=

list

(fi

lte

r(l

ambd

ax

:x

[0]

==’1

’,d

ata

))

amo

un

t_o

f_n

on

sarc

_tra

in=

rou

nd(l

en(n

on

sarc

_dat

a)

⇤0

.8)

amo

un

t_o

f_n

on

sarc

_tes

t=

len

(no

nsa

rc_d

ata

)�

amo

un

t_o

f_n

on

sarc

_tra

in

amo

un

t_o

f_sa

rc_t

rain

=ro

und

(len

(sar

c_d

ata

)⇤

0.8

)am

ou

nt_

of_

sarc

_tes

t=

len

(sar

c_d

ata

)�

amo

un

t_o

f_sa

rc_t

rain

wit

hop

en(R

ES_T

RA

IN_C

SV,

’w’)

astr

ain

_fi

le:

wri

ter

=cs

v.w

rite

r(t

rain

_fi

le,

del

imit

er=

’,’,

qu

otin

g=

csv

.QU

OTE

_ALL

)

APPENDIX B. SOURCE CODE 37

wri

ter

.wri

tero

w(h

ead

ers

)

wri

tten

_no

nsa

rc=

0

wh

ile

wri

tten

_no

nsa

rc<

amo

un

t_o

f_n

on

sarc

_tra

in:

ind

ex=

rand

om.r

and

int(

0,l

en(n

on

sarc

_dat

a)�

1)ro

w=

no

nsa

rc_d

ata

.pop

(in

dex

)w

rite

r.w

rite

row

(row

)w

ritt

en_n

on

sarc

+=1

wri

tten

_sar

c=

0

wh

ile

wri

tten

_sar

c<

amo

un

t_o

f_sa

rc_t

rain

:in

dex

=ra

ndom

.ran

din

t(0

,len

(sar

c_d

ata

)�

1)ro

w=

sarc

_dat

a.p

op(i

nd

ex)

wri

ter

.wri

tero

w(r

ow)

wri

tten

_sar

c+=

1

wit

hop

en(R

ES_T

EST_

CSV

,’w

’)as

test

_fi

le:

wri

ter

=cs

v.w

rite

r(t

est

_fi

le,

del

imit

er=

’,’,

qu

otin

g=

csv

.QU

OTE

_ALL

)

wri

ter

.wri

tero

w(h

ead

ers

)

38 APPENDIX B. SOURCE CODE

wri

tten

_no

nsa

rc=

0

wh

ile

wri

tten

_no

nsa

rc<

amo

un

t_o

f_n

on

sarc

_tes

t:in

dex

=ra

ndom

.ran

din

t(0

,len

(no

nsa

rc_d

ata

)�

1)ro

w=

no

nsa

rc_d

ata

.pop

(in

dex

)w

rite

r.w

rite

row

(row

)w

ritt

en_n

on

sarc

+=1

wri

tten

_sar

c=

0

wh

ile

wri

tten

_sar

c<

amo

un

t_o

f_sa

rc_t

est:

ind

ex=

rand

om.r

and

int(

0,l

en(s

arc_

dat

a)�

1)ro

w=

sarc

_dat

a.p

op(i

nd

ex)

wri

ter

.wri

tero

w(r

ow)

wri

tten

_sar

c+=

1

B.2

big

ram

B.2

.1p

re

ma

de

_e

sti

ma

tor.

py

#C

opyr

igh

t20

16Th

eT

enso

rFlo

wA

uth

ors

.A

llR

igh

tsR

eser

ved

.# #

Lic

ense

du

nder

the

Apa

che

Lic

ense

,V

ersi

on2.

0(t

he

"Lic

ense

");

#yo

um

ayn

otu

seth

isfi

leex

cep

tin

com

pli

ance

wit

hth

eL

icen

se.

APPENDIX B. SOURCE CODE 39

#Y

oum

ayo

bta

ina

cop

yo

fth

eL

icen

seat

# #h

ttp

://w

ww

.ap

ach

e.o

rg/

lice

nse

s/LI

CEN

SE�

2.0

# #U

nle

ssre

qu

ired

byap

pli

cab

lela

wor

agre

edto

inw

riti

ng

,so

ftw

are

#d

istr

ibu

ted

und

erth

eL

icen

seis

dis

trib

ute

don

an"A

SIS

"BA

SIS

,#

WIT

HO

UT

WA

RRA

NTI

ESO

RC

ON

DIT

ION

SO

FA

NY

KIN

D,

eith

erex

pre

ssor

imp

lied

.#

See

the

Lic

ense

for

the

spe

cifi

cla

ngu

age

gov

ern

ing

per

mis

sio

ns

and

#li

mit

ati

on

su

nder

the

Lic

ense

.""

"An

Exa

mpl

eo

fa

DN

NC

lass

ifie

rfo

rth

eIr

isd

atas

et."

""fr

om__

futu

re__

imp

ort

abso

lute

_im

po

rtfr

om__

futu

re__

imp

ort

div

isio

nfr

om__

futu

re__

imp

ort

pri

nt_

fun

ctio

n

imp

ort

arg

par

seim

por

tte

nso

rflo

was

tf

imp

ort

iris

_d

ata

par

ser

=ar

gp

arse

.Arg

um

entP

arse

r()

par

ser

.ad

d_a

rgu

men

t(’�

�b

atch

_siz

e’,

def

ault

=10

0,

typ

e=in

t,

hel

p=

’bat

chsi

ze’)

par

ser

.ad

d_a

rgu

men

t(’�

�tr

ain

_ste

ps

’,d

efau

lt=

1000

,ty

pe=

int

,h

elp

=’n

umbe

ro

ftr

ain

ing

step

s’)

40 APPENDIX B. SOURCE CODE

par

ser

.ad

d_a

rgu

men

t(’�

�m

odel

_dir

’,d

efau

lt=

’mod

els/

bow

’,ty

pe=

str

,h

elp

=’d

irec

tory

tosa

ve

mod

elch

eck

po

ints

’)

def

mai

n(a

rgv

):ar

gs

=p

arse

r.p

arse

_arg

s(a

rgv

[1:]

)

#F

etch

the

dat

a(t

rain

_x,

trai

n_y

),(t

est_

x,

test

_y)

=ir

is_

da

ta.l

oad

_dat

a()

#F

eatu

reco

lum

nsd

escr

ibe

how

tou

seth

ein

pu

t.m

y_fe

atu

re_c

olu

mn

s=

[]fo

rke

yin

trai

n_x

.key

s()

:m

y_fe

atu

re_c

olu

mn

s.a

ppen

d(

tf.f

eatu

re_c

olu

mn

.nu

mer

ic_c

olu

mn

(key

=ke

y))

#B

uil

d2

hid

den

lay

erDN

Nw

ith

10,

10u

nit

sre

spec

tiv

ely

.cl

ass

ifie

r=

tf.e

stim

ato

r.D

NN

Cla

ssif

ier(

feat

ure

_co

lum

ns=

my_

feat

ure

_col

um

ns

,#

Two

hid

den

lay

ers

of

10no

des

each

.h

idd

en_u

nit

s=[1

0,

10],

#T

hem

odel

mus

tch

oose

betw

een

3cl

ass

es.

n_c

lass

es=

2,

mod

el_d

ir=

arg

s.m

odel

_dir

)

APPENDIX B. SOURCE CODE 41

wh

ile

Tru

e:

#T

rain

the

Mod

el.

cla

ssif

ier

.tra

in(

inp

ut_

fn=l

ambd

a:i

ris_

da

ta.t

rain

_in

pu

t_fn

(tra

in_x

,tr

ain

_y,

arg

s.b

atch

_siz

e),

step

s=ar

gs

.tra

in_

step

s)

#E

val

uat

eth

em

odel

.ev

al_

resu

lt=

cla

ssif

ier

.ev

alu

ate

(in

pu

t_fn

=lam

bda

:iri

s_d

ata

.ev

al_i

np

ut_

fn(t

est_

x,

test

_y,

arg

s.b

atch

_siz

e))

pri

nt(

’{ac

cura

cy:f

}\n

’.fo

rmat

(⇤⇤

eva

l_re

sult

))

if__

nam

e__

==’_

_mai

n__

’:tf

.lo

gg

ing

.set

_ver

bo

sity

(tf

.lo

gg

ing

.IN

FO)

tf.a

pp.r

un

(mai

n)

B.2

.2ir

is_

da

ta.p

y

imp

ort

pan

das

aspd

imp

ort

ten

sorf

low

astf

imp

ort

csv

42 APPENDIX B. SOURCE CODE

TRA

IN_P

ATH

=".

./re

s/tr

ain

.csv

"TE

ST_P

ATH

=".

./re

s/te

st.c

sv"

HEA

DER

S_PA

TH=

"../

res/

hea

der

s.c

sv"

VOCA

BULA

RY_P

ATH

=".

./re

s/v

ocab

ula

ry.t

xt"

def

load

_col

um

n_na

mes

():

hea

der

s=

[]

wit

hop

en(H

EAD

ERS_

PATH

,’r

’)as

hea

der

s_fi

le:

read

er=

csv

.rea

der

(hea

der

s_fi

le)

for

row

inre

ader

:h

ead

ers

.app

end

(row

)

retu

rnh

ead

ers

[0]

def

load

_dat

a(y

_nam

e=’l

abel

’):

"""R

etu

rns

the

iris

dat

aset

as(t

rain

_x,

trai

n_y

),(t

est_

x,

test

_y).

"""

col_

nam

es=

load

_col

um

n_na

mes

()

tra

in=

pd.r

ead

_csv

(TRA

IN_P

ATH

,na

mes

=co

l_n

ames

,h

ead

er=

0,

qu

otin

g=

csv

.QU

OTE

_ALL

)tr

ain

_x,

trai

n_y

=tr

ain

,tr

ain

.pop

(y_n

ame)

APPENDIX B. SOURCE CODE 43

test

=pd

.rea

d_c

sv(T

EST_

PATH

,na

mes

=co

l_n

ames

,h

ead

er=

0,

qu

otin

g=

csv

.QU

OTE

_ALL

)te

st_x

,te

st_y

=te

st,

test

.pop

(y_n

ame)

retu

rn(t

rain

_x,

trai

n_y

),(t

est_

x,

test

_y)

def

trai

n_i

np

ut_

fn(f

eatu

res

,la

bel

s,

bat

ch_s

ize

):""

"An

inp

ut

fun

ctio

nfo

rtr

ain

ing

"""

#C

onve

rtth

ein

pu

tsto

aD

atas

et.

dat

aset

=tf

.dat

a.D

atas

et.f

rom

_ten

sor_

slic

es((

dic

t(fe

atu

res

),la

bel

s))

#Sh

uff

le,

rep

eat

,an

db

atch

the

exam

ple

s.

dat

aset

=d

atas

et.s

hu

ffle

(50

00

).re

pea

t()

.bat

ch(b

atch

_siz

e)

#R

etu

rnth

ed

atas

et.

retu

rnd

atas

et

def

eval

_in

pu

t_fn

(fea

ture

s,

lab

els

,b

atch

_siz

e):

"""A

nin

pu

tfu

nct

ion

for

eval

uat

ion

orp

red

icti

on

"""

feat

ure

s=d

ict(

feat

ure

s)

ifla

bel

sis

Non

e:

#N

ola

bel

s,

use

only

feat

ure

s.

inp

uts

=fe

atu

res

44 APPENDIX B. SOURCE CODE

else

: inp

uts

=(f

eatu

res

,la

bel

s)

#C

onve

rtth

ein

pu

tsto

aD

atas

et.

dat

aset

=tf

.dat

a.D

atas

et.f

rom

_ten

sor_

slic

es(i

np

uts

)

#B

atch

the

exam

ple

sa

sser

tb

atch

_siz

eis

not

Non

e,"b

atch

_siz

em

ust

not

beN

one"

dat

aset

=d

atas

et.b

atch

(bat

ch_s

ize

)

#R

etu

rnth

ed

atas

et.

retu

rnd

atas

et

#Th

ere

mai

nd

ero

fth

isfi

leco

nta

ins

asi

mp

leex

amp

leo

fa

csv

par

ser

,#

imp

lem

ente

du

sin

ga

the

‘Dat

aset

‘cl

ass

.

#‘t

f.p

arse

_csv

‘se

tsth

ety

pes

of

the

outp

uts

tom

atch

the

exam

ple

sg

iven

in#

the

‘rec

ord

_def

ault

s‘

argu

men

t.C

SV_T

YPE

S=

[[0

.0]

,[0

.0]

,[0

.0]

,[0

.0]

,[0

]]

def

_par

se_l

ine

(lin

e):

#D

ecod

eth

eli

ne

into

its

fie

lds

fie

lds

=tf

.dec

ode_

csv

(lin

e,

reco

rd_d

efau

lts=

CSV

_TY

PES

)

APPENDIX B. SOURCE CODE 45

#P

ack

the

resu

ltin

toa

dic

tio

nar

yfe

atu

res

=d

ict(

zip

(loa

d_c

olu

mn_

nam

es()

,fi

eld

s))

#Se

par

ate

the

lab

elfr

omth

efe

atu

res

lab

el=

feat

ure

s.p

op(’

Spec

ies

’)

retu

rnfe

atu

res

,la

bel

def

csv

_in

pu

t_fn

(csv

_pat

h,

bat

ch_s

ize

):#

Cre

ate

ad

atas

etco

nta

inin

gth

ete

xt

lin

es.

dat

aset

=tf

.dat

a.T

extL

ineD

atas

et(c

sv_p

ath

).sk

ip(1

)

#P

arse

each

lin

e.

dat

aset

=d

atas

et.m

ap(_

par

se_l

ine

)

#Sh

uff

le,

rep

eat

,an

db

atch

the

exam

ple

s.

dat

aset

=d

atas

et.s

hu

ffle

(50

00

).re

pea

t()

.bat

ch(b

atch

_siz

e)

#R

etu

rnth

ed

atas

et.

retu

rnd

atas

et

B.2

.3p

re

_p

roc

.py

46 APPENDIX B. SOURCE CODE

imp

ort

csv

imp

ort

sys

PRO

CES

SED

_DA

TA_C

SV=

’../

res/

pro

cess

ed_d

ata

.csv

VO

CABU

LARY

_TXT

=’.

./re

s/v

ocab

ula

ry.t

xt’

RES_

HEA

DER

S_C

SV=

’../

res/

hea

der

s.c

sv’

SARC

ASM

_V__

CSV

=’.

./re

s/sa

rcas

m_v

2.c

sv’

RES_

SARC

ASM

_V__

CSV

=’.

./re

s/sa

rcas

m_v

2.c

sv’

skip

ped

_hea

din

gs

=F

alse

wit

hop

en(R

ES_S

ARC

ASM

_V__

CSV

,’r

’)as

corp

us

:sa

rcas

m_r

ead

er=

csv

.rea

der

(cor

pu

s,

del

imit

er=

’,’)

text

_acc

um

ula

tor

=’’

for

row

insa

rcas

m_r

ead

er:

ifn

otsk

ipp

ed_h

ead

ing

s:

skip

ped

_hea

din

gs

=T

rue

con

tin

ue

APPENDIX B. SOURCE CODE 47

text

_acc

um

ula

tor

+=’

’+

row

[4]

voca

b=

{wor

dfo

rw

ord

inte

xt_a

ccu

mu

lato

r.s

pli

t()

}

pri

nt(

"Don

eb

uil

din

gvo

cab

")p

rin

t(le

n(v

ocab

),"

un

iqu

ew

ords

")

skip

ped

_hea

din

gs

=F

alse

pro

cess

ed_r

ows

=[]

wit

hop

en(S

ARC

ASM

_V__

CSV

,’r

’)as

corp

us

:sa

rcas

m_r

ead

er=

csv

.rea

der

(cor

pu

s,

del

imit

er=

’,’)

i=

0fo

rro

win

sarc

asm

_rea

der

:i

+=1

sys

.std

ou

t.w

rite

(’\

rrow

{0:0

5d

}’.

form

at(i

))sy

s.s

tdo

ut.

flu

sh()

ifn

otsk

ipp

ed_h

ead

ing

s:

skip

ped

_hea

din

gs

=T

rue

con

tin

ue

tex

t=

row

[4]

lab

el=

row

[1]

pro

cess

ed_r

ow=

[]

48 APPENDIX B. SOURCE CODE

ifla

bel

=="n

ots

arc

":p

roce

ssed

_row

.app

end

(0)

else

: pro

cess

ed_r

ow.a

ppen

d(1

)

loca

l_v

oca

b=

{}

wor

ds=

tex

t.s

pli

t()

for

wor

din

wor

ds:

ifw

ord

not

inlo

cal_

vo

cab

:lo

cal_

vo

cab

[wor

d]

=1

else

: loca

l_v

oca

b[w

ord

]=

loca

l_v

oca

b[w

ord

]+

1

for

wor

din

voca

b:

ifw

ord

not

inlo

cal_

vo

cab

:lo

cal_

vo

cab

[wor

d]

=0

for

wor

din

sort

ed(l

oca

l_v

oca

b.k

eys

()):

pro

cess

ed_r

ow.a

ppen

d(l

oca

l_v

oca

b[w

ord

])

pro

cess

ed_r

ows

.app

end

(pro

cess

ed_r

ow)

APPENDIX B. SOURCE CODE 49

pri

nt(

"\nD

one

cou

nti

ng

occ

ure

nce

s")

hea

der

s=

["la

bel

"]

for

wor

din

sort

ed(v

ocab

):h

ead

ers

.app

end

(wor

ds)

wit

hop

en(

’../

res/

real

_hea

der

s.c

sv’,

’w’)

ash

ead

ers_

file

:w

rite

r=

csv

.wri

ter

(hea

der

s_fi

le)

wri

ter

.wri

tero

w(h

ead

ers

)

hea

der

s=

["la

bel

"]

i=

0fo

rw

ord

inso

rted

(voc

ab):

hea

der

s.a

ppen

d(i

)i

+=1

wit

hop

en(R

ES_H

EAD

ERS_

CSV

,’w

’)as

hea

der

s_fi

le:

wri

ter

=cs

v.w

rite

r(h

ead

ers_

file

)w

rite

r.w

rite

row

(hea

der

s)

pri

nt(

"Hea

der

ssa

ved

")

50 APPENDIX B. SOURCE CODE

wit

hop

en(P

ROC

ESSE

D_D

ATA

_CSV

,’w

’)as

sav

e_fi

le:

wri

ter

=cs

v.w

rite

r(s

ave_

file

,d

elim

iter

=’,

’,q

uot

ing

=cs

v.Q

UO

TE_A

LL)

wri

ter

.wri

tero

w(h

ead

ers

)

for

pro

cess

ed_r

owin

pro

cess

ed_r

ows

:w

rite

r.w

rite

row

(pro

cess

ed_r

ow)

pri

nt(

"Dat

asa

ved

")

wit

hop

en(V

OCA

BULA

RY_T

XT,

’w’)

asv

oc_

file

:ke

ys=

list

(voc

ab)

keys

.so

rt()

for

key

inke

ys:

vo

c_fi

le.w

rite

(str

(key

))v

oc_

file

.wri

te("

\n

")

B.2

.4tr

ain

_te

st_

sp

litt

er.

py

imp

ort

csv

imp

ort

rand

om

RES

_TES

T_C

SV=

’../

res/

test

.csv

APPENDIX B. SOURCE CODE 51

RES

_TR

AIN

_CSV

=’.

./re

s/tr

ain

.csv

PRO

CES

SED

_DA

TA_C

SV=

’../

res/

pro

cess

ed_d

ata

.csv

wit

hop

en(P

ROC

ESSE

D_D

ATA

_CSV

,’r

’)as

pro

c_d

ata_

file

:re

ader

=cs

v.r

ead

er(p

roc_

dat

a_fi

le,

del

imit

er=

’,’,

qu

otin

g=

csv

.QU

OTE

_ALL

)

dat

a=

[]fo

rro

win

read

er:

dat

a.a

ppen

d(r

ow)

hea

der

s=

dat

a.p

op(0

)

no

nsa

rc_d

ata

=li

st(

filt

er

(lam

bda

x:

x[0

]==

’0’,

dat

a))

sarc

_dat

a=

list

(fi

lte

r(l

ambd

ax

:x

[0]

==’1

’,d

ata

))

amo

un

t_o

f_n

on

sarc

_tra

in=

rou

nd(l

en(n

on

sarc

_dat

a)

⇤0

.8)

amo

un

t_o

f_n

on

sarc

_tes

t=

len

(no

nsa

rc_d

ata

)�

amo

un

t_o

f_n

on

sarc

_tra

in

amo

un

t_o

f_sa

rc_t

rain

=ro

und

(len

(sar

c_d

ata

)⇤

0.8

)am

ou

nt_

of_

sarc

_tes

t=

len

(sar

c_d

ata

)�

amo

un

t_o

f_sa

rc_t

rain

wit

hop

en(R

ES_T

RA

IN_C

SV,

’w’)

astr

ain

_fi

le:

wri

ter

=cs

v.w

rite

r(t

rain

_fi

le,

del

imit

er=

’,’,

qu

otin

g=

csv

.QU

OTE

_ALL

)

52 APPENDIX B. SOURCE CODE

wri

ter

.wri

tero

w(h

ead

ers

)

wri

tten

_no

nsa

rc=

0

wh

ile

wri

tten

_no

nsa

rc<

amo

un

t_o

f_n

on

sarc

_tra

in:

ind

ex=

rand

om.r

and

int(

0,l

en(n

on

sarc

_dat

a)�

1)ro

w=

no

nsa

rc_d

ata

.pop

(in

dex

)w

rite

r.w

rite

row

(row

)w

ritt

en_n

on

sarc

+=1

wri

tten

_sar

c=

0

wh

ile

wri

tten

_sar

c<

amo

un

t_o

f_sa

rc_t

rain

:in

dex

=ra

ndom

.ran

din

t(0

,len

(sar

c_d

ata

)�

1)ro

w=

sarc

_dat

a.p

op(i

nd

ex)

wri

ter

.wri

tero

w(r

ow)

wri

tten

_sar

c+=

1

wit

hop

en(R

ES_T

EST_

CSV

,’w

’)as

test

_fi

le:

wri

ter

=cs

v.w

rite

r(t

est

_fi

le,

del

imit

er=

’,’,

qu

otin

g=

csv

.QU

OTE

_ALL

)

wri

ter

.wri

tero

w(h

ead

ers

)

APPENDIX B. SOURCE CODE 53

wri

tten

_no

nsa

rc=

0

wh

ile

wri

tten

_no

nsa

rc<

amo

un

t_o

f_n

on

sarc

_tes

t:in

dex

=ra

ndom

.ran

din

t(0

,len

(no

nsa

rc_d

ata

)�

1)ro

w=

no

nsa

rc_d

ata

.pop

(in

dex

)w

rite

r.w

rite

row

(row

)w

ritt

en_n

on

sarc

+=1

wri

tten

_sar

c=

0

wh

ile

wri

tten

_sar

c<

amo

un

t_o

f_sa

rc_t

est:

ind

ex=

rand

om.r

and

int(

0,l

en(s

arc_

dat

a)�

1)ro

w=

sarc

_dat

a.p

op(i

nd

ex)

wri

ter

.wri

tero

w(r

ow)

wri

tten

_sar

c+=

1

www.kth.se