sarcasm detection with tensorflow1214412/fulltext02.pdf · other things be useful in marketing, ......
TRANSCRIPT
INOM EXAMENSARBETE TEKNIK,GRUNDNIVÅ, 7,5 HP
, STOCKHOLM SVERIGE 2018
Sarcasm Detection with TensorFlow
LUDVIG PERSSON
JESPER LARSSON
KTHSKOLAN FÖR ELEKTROTEKNIK OCH DATAVETENSKAP
Sarcasm Detection with
TensorFlow
LUDVIG PERSSON, JESPER LARSSON
Civilingenjör DatateknikDate: June 6, 2018Supervisor: Ric GlasseyExaminer: Örjan EkebergSwedish title: Upptäcka Sarkasm med TensorFlowSchool of Electrical Engineering and Computer Science
iii
Abstract
Sentiment analysis is the process of letting a computer guess the senti-ment of someone towards something based on a text. This can amongother things be useful in marketing, for example in the case of thecomputer figuring out that a certain person likes a certain product itcan present ads for similar products to the person. Sentiment analy-sis in social media is when the texts analyzed are from a social mediacontext like comments or posts on Twitter, Facebook, etc. One prob-lematic aspect of these texts is sarcasm. People tend to be sarcasticvery often in social media, with sarcasm being something that can behard to detect even for a human this does cause problems for the com-puter. This study was conducted with the intention of investigatinghow sarcasm detection can be performed in social media texts with thehelp of machine learning. For this purpose Google’s machine learningframework for Python, TensorFlow, was utilized. The machine learn-ing model created was a deep neural network with two hidden layerscontaining ten nodes each. As for the input a dataset of 4692 texts wereused with a 80/20 training/testing split. For preprocessing the textsinto a more suitable form for TensorFlow the methods Bag of Words,Bigrams and a naive method here refered to as Char for Char were con-sidered. However due to time constraints proper results from the moreadvanced approaches (Bigrams and Bag of Words) were not achieved.It was at least found that the rather simple approach was better thanexpected, with results notably better than 50% that would be highlyunlikely to achieve through sheer luck.
iv
Sammanfattning
Sentimentanalys är när en dator får till uppgift att gissa vad någontycker on någonting baserat på en text. Detta kan bland annat varaanvändbart för marknadsföring, till exempel i fallet då en dator listatut att en person tycker om en produkt kan den visa personen annon-ser för liknande produkter. Sentiment analys i sociala medier är närtexterna som analyseras är från sociala medier, som inlägg och kom-mentarer från facebook, twitter, etc. En problematisk aspekt av des-sa texter är sarkasm. Folk tenderar att vara sarkastiska ofta i socialamedier, samtidigt som sarkasm kan vara svårt att upptäcka även fören människa som läser texten. Denna studie genomfördes med avsik-ten att undersöka hur sarkasmdetektion kan genomföras på texter frånsociala medier med hjälp av maskininlärning. För det syftet användesGoogles maksininlärnings ramverk för Python: TensorFlow. Maskinin-lärningsmodellen som skapades med hjälp av ramverket var ett deepneural network med två hidden layers som består av tio noder var. Förinput användes ett dataset på 4692 texter med en 80/20 tränings/test-nings split. För att omvandla texterna till en form som är kompatibelmed TensorFlow togs metoderna Bag of Words, Bigrams, och en naivmetod här kallad Char for Char i beaktande. Tyvärr ledde brist på tidtill att ordentliga resultat från de mer avancerade metoderna Bag ofWords och Bigrams inte uppnådes. Däremot så ledde den naiva meto-den till resultat som skiljer sig markant från 50% och som skulle varaextremt osannolika att uppnå genom ren tur.
Contents
1 Introduction 11.1 Research Question . . . . . . . . . . . . . . . . . . . . . . 21.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Theory 32.1 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Sentiments/Opinions . . . . . . . . . . . . . . . . 32.1.2 Sentiment analysis in social media . . . . . . . . . 42.1.3 Earlier work . . . . . . . . . . . . . . . . . . . . . . 4
2.2 TensorFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.1 Features and preprocessing . . . . . . . . . . . . . 62.3.2 Char for Char . . . . . . . . . . . . . . . . . . . . . 82.3.3 Bag of words . . . . . . . . . . . . . . . . . . . . . 92.3.4 Bigrams . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 Method 113.1 Aquiring a dataset . . . . . . . . . . . . . . . . . . . . . . 113.2 Specifying the Model . . . . . . . . . . . . . . . . . . . . . 133.3 Dataset Transformation . . . . . . . . . . . . . . . . . . . . 133.4 Train-test split . . . . . . . . . . . . . . . . . . . . . . . . . 133.5 Training and evaluating . . . . . . . . . . . . . . . . . . . 13
4 Results 15
5 Discussion 17
6 Conclusion 19
v
vi CONTENTS
Bibliography 20
A Tabular results 22
B Source Code 24B.1 Char for Char . . . . . . . . . . . . . . . . . . . . . . . . . 24
B.1.1 premade_estimator.py . . . . . . . . . . . . . . . . 24B.1.2 iris_data.py . . . . . . . . . . . . . . . . . . . . . . 27B.1.3 pre_proc.py . . . . . . . . . . . . . . . . . . . . . . 32B.1.4 train_test_splitter.py . . . . . . . . . . . . . . . . . 35
B.2 bigram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38B.2.1 premade_estimator.py . . . . . . . . . . . . . . . . 38B.2.2 iris_data.py . . . . . . . . . . . . . . . . . . . . . . 41B.2.3 pre_proc.py . . . . . . . . . . . . . . . . . . . . . . 45B.2.4 train_test_splitter.py . . . . . . . . . . . . . . . . . 50
Chapter 1
Introduction
Sentiment analysis or sentiment classification within computer scienceis the process of letting software guess what the sentiment of the au-thor of a provided text is. The usual way of doing this is to let theprogram guess whether the sentiment is positive or negative. Some-times the classification of neutral sentiment alongside the other two isalso interesting.
Sentiment analysis has one potential use in giving insight into userpreferences, which could be useful in a wide array of applications suchas in advertising or when creating product rankings. As an example:ponder that your company has made a change to a product beloved bythe consumers and are worried what recpetion this action will have.The consumers will likely vent their feelings on social media, givingyou an insight into their reaction.
Automating this task with computers would allow one to analysemore reactions faster than compared a human. However, the analysisthe computer provides need to have a high accuracy to be a useful tool.Consider that a coin flip likely would have an accuracy of 50% whendeciding between two categories, such as positive and negative.
A challenge one faces when improving the accuracy of sentimentanalysis is the complexities of human language. Sarcasm, a commonway of expressing opinions in social and political discussion[13], hasbeen proposed to be one contributing factor in making sentiment anal-ysis hard to perform[9]. Even humans can often have trouble detectingsarcasm in text, due to many important cues signaling sarcasm, like fa-cial expression or tone of voice, not being present in written text. Beingsarcastic is, simplified, to say something meaning something else than
1
2 CHAPTER 1. INTRODUCTION
what is explicitly stated. More formally this report will rely on the fol-lowing definition of sarcasm from the authors of the dataset that willused in this text [12]:
A definition of sarcasm.
1. a sharp and often satirical or ironic utterance designed to behumorous, snarky, or mocking.
2. a mode of satirical wit depending for its effect on bitter,caustic, and often ironic language that is often directed againstan individual or a situation.
Successfully detecting sarcasm in a text could be used to improveprediction accuracy. When sarcasm is detected in a text the polarity ofthe sentiment prediction can be reversed: from positive to negative orvice versa.
1.1 Research Question
The aim of this study is to evaluate if we can develop a sarcasm de-tector for short form text commonly used in social media or internetforums. More concretley:
• Can machine learning be used on a dataset annotated either assarcastic or not sarcastic and achieve acceptable accuracy in judg-ing if a text is sarcastic or not?
• If yes, what kind of accuracy is it capable of?
1.2 Scope
This study will only consider a deep neural network model created byGoogle’s machine learning framework TensorFlow. The dataset thatwill be used is the sarcasm_v2 dataset[12]. It contains 4692 quote-response pairs labeled as sarcastic or non-sarcastic based upon if theresponse is sarcastic or not. Half of the dataset is labeled sarcastic.
Chapter 2
Theory
2.1 Sentiment Analysis
“Sentiment analysis, also called opinion mining, is the field of studythat analyzes people’s opinions, sentiments, evaluations, appraisals,attitudes, and emotions towards entities such as products, services, or-ganizations, individuals, issues, events, topics, and their attributes.”[13]
2.1.1 Sentiments/Opinions
According to A Practical Guide to Sentiment Analysis[3] “Sentimentanalysis mainly studies opinions that express or imply positive or neg-ative sentiment”. Further, opinions are defined as such that “An opin-ion is a quadruple, (g, s, h, t) where g is the sentiment target, s is thesentiment of the opinion about the target g, h is the opinion holder (theperson or organization who holds the opinion), and t is the time whenthe opinion is expressed”
Sentiment Analysis in Social Networks[13] similarly defines theterm “opinion” as such: “an opinion is a quintuple, (ei, aij,sijkl, hk,tl), where ei is the name of an entity, aij is an aspect of ei, sijkl is thesentiment on aspect aij of entity ei, hk denotes the opinion holder, andtl is the time when the opinion is expressed by hk.”
This second book further claims, just like Pozzi et al., that there issome who would call the subject “opinion mining” and some “sen-timent analysis”[13]. A sentiment would be “I like the color green”while an opinion would be “I think that green is a good color”, soalthough there is a difference it is rather subtle. It will therefore be
3
4 CHAPTER 2. THEORY
discussed here as if it is the same subject. Also, while these opinionsare a whole lot more complex this study will only be focusing on thesentiment itself.
2.1.2 Sentiment analysis in social media
A general description of the problems faced within this area of re-search: “In fact, social network sentiment analysis, in addition to in-heriting a multitude of issues from traditional sentiment analysis andnatural language processing, introduces further complexities (shortmessages, noisy content, metadata such as gender, location, and age)and new sources of information not leveraged in traditional approaches.”[13].
The metadata will not be relevant to this study, however the waythe messages are formatted and the language used when writing themwill be. Due to the informal writing featured in social media the lex-ical approach where you consult a lexicon for what sentimental valuea word holds, which has done very well for example when analysingmovie reviews [15], is no longer as useful of an approach without ex-tensive preprocessing[13].
Yang et al. worded it: “Most existing techniques rely on naturallanguage processing tools to parse and analyze sentences in a review,yet they offer poor accuracy, because the writing in online reviewstends to be less formal than writing in news or journal articles. Manyopinion sentences contain grammatical errors and unknown terms thatdo not exist in dictionaries.”[16] A machine learning approach doesnot suffer the same problems since it does not need to have data on ev-ery single word it can potentially encounter. What the machine learn-ing model does do instead is explained in the TensorFlow and NeuralNetworks sections.
2.1.3 Earlier work
A fair amount of research on the subjects of sentiment analysis andsarcasm detection has been performed, however we did not find muchresearch that was dedicated to the combination of the two. The tables2.1 and 2.2 shows results from sentiment analysis that was ran on justtwitter input vs sarcastic twitter input [5]. It can be seen that the suc-cess of determining sentiment for the sarcastic tweets is around 50% inthe 2014 results. 2015 results are a bit better for sarcasm detection but
CHAPTER 2. THEORY 5
System Twitter 2014 Sarcasm 2014TeamX 70.96 56.50
coooolll 70.14 46.66RTRGO 69.95 47.09
NRC-Canada 69.85 58.16TUGAS 69.00 52.87
CISUC_KIS 67.95 55.49SAIL 67.77 57.26
Table 2.1: Sentiment Analysis Task in F-Measure Terms for Both Regu-lar and Sarcastic Tweets in the 2014 Edition of SemEval.
System Twitter 2015 Sarcasm 2015Webis 64.84 53.59unitn 64.59 55.01lsislif 64.27 46.00
INESC-ID 64.17 64.91Splusplus 63.73 60.99
Table 2.2: Best Results in the Sentiment Analysis Task in F-MeasureTerms for Both Regular and Sarcastic Tweets in the 2015 Edition ofSemEval.
with lower success for nonsarcastic data.Another example of earlier results is a study on greek tweets about
the 2015 greek election[2]. They asked the general public to performthe task of annotation and ended up with about 4600 annotated tweets.The following results were achieved.
These results are better, however with data that has been annotatedby “134 different user sessions” meaning 134 or less different unknownindividuals.
Category Precision Recall f1-score Test SamplesNon-sarcastic 0.69 0.62 0.65 621
Sarcastic 0.72 0.78 0.75 772Average/total 0.70 0.71 0.70 1393
Table 2.3: Results of sentiment analysis on tweets from greek election.
6 CHAPTER 2. THEORY
2.2 TensorFlow
TensorFlow is a machine learning framework created by google. Itwill be utilized to create a deep neural network model which in turnis going to be used for predicting if a text is sarcastic or not. Ten-sorFlow provides a concept called Estimators, they are defined as: “ahigh-level TensorFlow API that greatly simplifies machine learningprogramming. Estimators encapsulate the following actions:
• training
• evaluation
• prediction
• export for serving
[4]
2.3 Neural Networks
A neural network can be defined as “A model that, taking inspira-tion from the brain, is composed of layers (at least one of which ishidden) consisting of simple connected units or neurons followed bynonlinearities”[10]. The first layer is the input layer, which is goingto contain the features the author of the neural network themselvesspecify. There is also the output layer, the final layer, which will be thelayer to present the answers the neural network has arrived at when itis done computing. In between these are the so called hidden layers.The idea is to have several hidden layers, each containing many nodes,which process data being provided to them. Accompanying the datais labels, giving the neural network feedback on if it was successful inlabeling the data. Based on this feedback values in the nodes of theneural network changes and in this way the neural network can learnto recognize patterns in the data.
2.3.1 Features and preprocessing
When used in TensorFlow the data will be in the form of a matrix. Ev-ery row in this matrix represents one input, in the case of this study
8 CHAPTER 2. THEORY
inputs are strings which contain one message each from an online dis-cussion board. Every column in the matrix is one feature. What dataa feature will encompass is to be defined by the programmer, and willvary depending on which method for processing the input is beingused.
In the case of using strings as input some method for processing thedata must be applied, since TensorFlow only operates on numericaldata[6]. This transformation should, if possible, preserve the orderingof the input string due to the impact it may have on how the string isinterpreted. Two strings with the same words can either be interpretedas sarcastic or not depending on the word order, e.g. “yeah, right” and“right, yeah”.
There exists multiple ways of transforming strings to numericaldata. Some methods used in earlier works are Bag of Words (BOW)and its generalisation n-grams[3][1]. A naïve approach, here called,Char for Char is also utilized.
2.3.2 Char for Char
This is a method of processing the data where each character in theinput string is represented as a feature. When preprocessing the firststep is to find the longest input string, when ranking by number ofcharacters in the string. The number of features is then set to the lengthof this input string. Each string is split into single characters, witheach character becoming a feature. Finally the row is padded to thelength of the longest input with blankspace characters making up theremaining features.
Figure 2.2: Illustration of Char for Char.
CHAPTER 2. THEORY 9
Figure 2.3: Illustration of Bag of Words and Bigrams.
2.3.3 Bag of words
Bag of words is a very different method of processing data. This methoddoes not handle characters but entire words. Each unique word in allof the input data combined is considered a feature (column). What isthen counted for each input (row) is the number of occurrences of eachword in that input. If the word “car” occurs 4 times in an input stringthen the “car” feature for that input will contain the number 4.
2.3.4 Bigrams
The bag of words method and the Bigrams method are essentially thesame. They are both specialized cases of the N-grams method (wherebag of words would be considered “1-grams” or “unigrams”). TheN-grams method works the same way as described above for bag ofwords, but with the important difference that not just every word butrather each group of N adjacent words found in the input is consid-ered a feature. If an input for example contained the string “Machinelearning is fun”, the features “Machine learning”, “learning is” and“is fun” would be extracted if using a bigram model (which is to say:when N = 2).
2.4 Metrics
For measuring accuracy will be used. Accuracy is the amount of truepositives plus the amount of true negatives, divided by the total num-
10 CHAPTER 2. THEORY
ber of examples. True positives are all the instances classified as posi-tive that are actually positive, true negative all the instances classifiedas negative which are actually negative. Accuracy then is the percentof instances correctly classified[14].
Accuracy =TruePositives+ TrueNegatives
TotalNumberOfExamples
Chapter 3
Method
The approach in this study is divided into four parts – firstly aquiring adataset, secondly choosing a machine learning model, and then trans-forming that dataset to a suitable format for TensorFlow, and lastlyusing the transformed data to train and evaluate a machine learningmodel. This chapter will describe these processes.
3.1 Aquiring a dataset
Earlier work has compiled different datasets with text annotated assarcastic or not, e.g. the two that we found: the Sarcasm Corpus V2[12]and the Self Annotated Reddit Corpus (SARC)[8].
We compared the two datasets mainly based on their quality andease of use. The 4692 examples in the Sarcasm Corpus V2 was anno-tated by crowdsourcing, meaning independet people had gone throughand annotated every example. This dataset was easily available in ancomma seperated values file. SARC was, as the name hints, annotatedby the authors of the comments themselves. (There exists a culture onreddit to mark ones comment with ’/s’ to indicate that it is sarcastic.)The annotated comments are probably being sarcastic but we can notbe sure if the unmarked ones are sarcastic or not. This diminishes thequality of the dataset. A normally positive trait of the dataset is that itconsists of circa 1.3 million annotated comments. This unfortunateleyleads to SARC being somewhat unwieldly considering the time andcomputing we have at our disposal so we decided to use the the Sar-casm Corpus V2 for this report since we deemed it being of higherquality and quite easy to work with.
11
12 CHAPTER 3. METHOD
Corpus Label ID Quote Text Response TextGEN sarc GEN_sarc_0000 First off, That’s
grade A USDA ap-proved Liberalismin a nutshell.
Therefore you ac-cept that the Re-publican party al-most as a wholeis "grade A USDAapproved Liberal-ism." About timeyou did.
GEN sarc GEN_sarc_0001 watch it. Nowyou’re using mylines. Poet has al-ways been an easytarget, I will agree.;)
More chatteringfrom the peanutgallery? Haven’tgotten the memo,you’re no longera player? Hon-estly....clamoringfor attention is solow budget. Noshame.
RQ notsarc RQ_notsarc_0397 This pretty muchsums up peoplelike Penfold. Thisdifinitivly showsthat he believesyour right to ownfirearms shouldbe taken away.Thank heaven ourfounding fatherssought to protectus from the likesof him and enu-merated our rightto keep and beararms. Happy NewYear
Don’t be so faith-ful in our laws.Remember prohi-bition? Of course,I myself highlydoubt that Obamawill manage toban firearms, butmark my words,he will do any-thing in his powerto restrict them.
Table 3.1: Some examples from the Sarcasm Corpus V2
CHAPTER 3. METHOD 13
3.2 Specifying the Model
There exists many excellent machine learning frameworks. We choseto use Googles Tensorflow framework for python for this study. Ten-sorflow offers good documentation and tools, which made it easy forus to get started with it and creating our model. We based our modelon the model from the documentations Get Started tutorial. [7], creat-ing a Deep Neural Network [11] with two hidden layers, each with 10nodes. Refer to B.1.1, B.1.2, B.2.1 and B.2.2 in the appendix.
3.3 Dataset Transformation
Before being able to start training our model we had to transformour dataset into a format suitable for Tensorflow. In our research N-grams or one of its spezialized forms, i.e. bag of words/unigrams andbigrams, were often used.[3] We decided to try these preprocessingmethods as well as also try the naïve Char for Char method. We im-plemented the preprocessing methods in python (see B.1.3, B.2.3 formore details) and created our transformed datasets.
3.4 Train-test split
The last step before we could start training our model were splittingour data into one train and one test set. We decided to dedicate 80%(3754 examples) of our dataset to training and the remaining 20% totesting (938 examples). We made sure to keep the ratio of sarcasticcomments to nonsarcastic the same in the test and training subdatasetsas in the orignial dataset, i.e. 50/50, to eliminate unintentional biasingof the data. Otherwise the examples were randomly assigned to thetrain or test subdataset. For more information see B.1.4 and B.2.4.
3.5 Training and evaluating
With the dataset prepared for our choosen model we started tran-ing and evaluating it. Tensorflow contains simple tools for traininga model created with it for some amount of steps (updates of the mod-els weights), each step operating on a batch of examples. It contains
14 CHAPTER 3. METHOD
similar tools for evaluating the accuracy of the trained model. We con-tinously trained for 1000 steps at a time followed by an evaluation.Each step used a batch of 100 examples. We output the result of eachevaluation, i.e. the accuracy, to the standard output and when we ranthe program piped it to a file. We let the model train and evaluate overnight.
Chapter 4
Results
The results from training and evaluating the model on the dataset pre-procssed with Char for Char was an accuracy of circa 57%. Figure 4.1shows how the accuracy changed over the course of training. The ex-act data is available in table A.1 in the appendix.
The accuracy of the model trained on the dataset preprocesed withChar for Char turned out to be quite constant with respect to the amounttrained. The accuracy hovered around an average of approximately57%. The probability of getting this accuracy or better by flipping acoin for each 100 examples in a batch (50% probability with 100 trials)would be approximately 9.7%.
15
16 CHAPTER 4. RESULTS
Figure 4.1: Chart of accuracy at s steps.
When training the model with the dataset preprocessed with theBag of Words method not a single training+evaluation cycle completedin one night. It was then decided this this approach took too long timefor reliable results to be achieved. This was suprising due to the Bagof Words method being used successfully before. [3] The bigram pre-procssing method was also discarded at this point, due to the output ofthe bigram preprocessing growing even quicker than the Bag of Wordsmethod. The dataset processed with the Bag of Words method became527MB (an increase of 200 times compared to the original dataset)and seems correct upon inspection. The Char for Char dataset became23MB after preprocessing (an increase of almost 9 times) in compari-sion.
Chapter 5
Discussion
This chapter will begin by discussing wheter our results are reason-able. It will then continue by discussing them in the broader contextof sentiment analysis.
With simple methods and tools have we developed a Tensorflowmodel capable of correctly classifying 57% of the examples in the testsubdataset when preprocessing it with the Char for Char method. While"only" beating the expected accuracy of coin flipping by 7 percentagepoints, the probability of getting such a result is approximately 9.7%as shown in chapter 4: Results. This insight, togther with the fact thatthe the ratio of sarcastic to non-sarcastic examples in all of the datasets,alludes to the model doing something more complex than flipping acoin or everytime guessing the same category (always sarcastic or al-ways non-sarcastic). It is hard to compare our results with what othershave achived due to earlier work using the F1-score while we usedaccuarcy as our meteric. This was a misstake in the design of the ex-periment. If redone it should measure F1-score allowing the results tobe compared to earlier work.
That we were not able to achive any results using the Bag of Wordspreprocessing method is strange considering that its frequent appere-ance in earlier work. This indicates that our implementation probablyis faulty in some way. The fact that the data became an order of magni-tude larger when processed with the Bag of Word method comparedto the Char for Char method while seeming correct points to that itmight not deal with complexities of the data in a sufficent manner.Further techniques to simplify the data, such as stripping the text ofpunctuation, correcting misspellings, collapsing inflections to the base
17
18 CHAPTER 5. DISCUSSION
word, etc., could have been employed and might have improved theresults. These techniques might also be beneficial for the Char for Charmethod.
One could improve the experiment and the models results in it,but how would it translate to the wider context of sentiment analysis?This report have also not determined how the created model wouldgeneralise to different datasets, such as general social media. Wouldthe model be able to keep its accuarcy in a real world test compared tothe shielded enviroment in this experiment? Would it even matter in areal world application? Would the preprocessing methods themselvesgeneralise? Further, it would be interesting in the future to see howa sarcasm detector as the one presented in this report would affectsentiment analysis methods. Would they benefit from being able totreat sarcastic comments seperately? If yes, what level of accuracy isneccessary for it to improve the result of the sentiment analysis? Theseare some points that would be interesting to research in the future.
For the preprocessing methods there are some characteristic thatmight affect how they generalise. The Bag of Words method workson the specific vocabulary found in the dataset it trained on. If a newword would be found in the real world data the model would not beable to handle it. It is conceivable that it would be acceptable to dis-card such words or change them to a symonym found in the modelsvocabulary. The Char for Char method instead works on the alpha-bet allowing it to gracefully accept new words. Of course, if a newcharacter would appear this method would face the same problem asthe Bag of Words method faced when presented with a new word. Itdoes however seem more rare that new characters are introduced intothe language than new words. Given this, the Char for Char methodmight be more resilient in a real world scenario.
Chapter 6
Conclusion
Given the results of this report we conclude that it is possible to usemachine learning, TensorFlow specifically, to detect sarcasm in shortfree form text found in social media. Our model, created with naïvemethods and tools, achieve an accuracy of 57%. We speculate that thisresult could be improved upon with more extensive preprocessing ormore traditional preprocessing methods.
19
Bibliography
[1] Basant Agarwal and Namita Mittal. “Introduction”. In: Promi-nent Feature Extraction for Sentiment Analysis. Springer, 2016, pp. 1–4.
[2] Despoina Antonakaki et al. “Social media analysis during polit-ical turbulence”. In: PloS one 12.10 (2017), e0186836.
[3] Erik Cambria et al. A practical guide to sentiment analysis. Vol. 5.Springer, 2017.
[4] Estimators | TensorFlow. URL: https://www.tensorflow.org/programmers_guide/estimators (visited on ).
[5] DI Hernández Farias and Paolo Rosso. “Irony, sarcasm, and sen-timent analysis”. In: Sentiment Analysis in Social Networks. Else-vier, 2017, pp. 113–128.
[6] Feature Columns | TensorFlow. URL: https://www.tensorflow.org/get_started/feature_columns (visited on ).
[7] Get Started with Graph Execution | TensorFlow. URL: https://www.tensorflow.org/get_started/get_started_for_beginners (visited on ).
[8] Mikhail Khodak, Nikunj Saunshi, and Kiran Vodrahalli. “A largeself-annotated corpus for sarcasm”. In: arXiv preprint arXiv:1704.05579(2017).
[9] Edwin Lunando and Ayu Purwarianti. “Indonesian social mediasentiment analysis with sarcasm detection”. In: Advanced Com-puter Science and Information Systems (ICACSIS), 2013 InternationalConference on. IEEE. 2013, pp. 195–198.
[10] Machine Learning Glossary | Google Developers. URL: https://developers.google.com/machine-learning/glossary/#neural_network (visited on ).
20
BIBLIOGRAPHY 21
[11] Machine Learning Glossary | Google Developers. URL: https://developers.google.com/machine-learning/glossary/#deep_model (visited on ).
[12] Shereen Oraby et al. “Creating and characterizing a diverse cor-pus of sarcasm in dialogue”. In: arXiv preprint arXiv:1709.05404(2017).
[13] Federico Alberto Pozzi et al. Sentiment analysis in social networks.Morgan Kaufmann, 2016.
[14] Claude Sammut and Geoffrey I Webb. Encyclopedia of machinelearning. Springer Science & Business Media, 2011.
[15] Tun Thura Thet, Jin-Cheon Na, and Christopher SG Khoo. “Aspect-based sentiment analysis of movie reviews on discussion boards”.In: Journal of information science 36.6 (2010), pp. 823–848.
[16] Christopher C Yang et al. “Understanding online consumer re-view opinions with sentiment analysis using machine learning”.In: Pacific Asia Journal of the Association for Information Systems 2.3(2010).
APPENDIX A. TABULAR RESULTS 23
After # of steps Char for Char1000 0,5735612000 0,5682303000 0,5692964000 0,5671645000 0,5682306000 0,5682307000 0,5703628000 0,5703629000 0,57142910000 0,57249511000 0,57249512000 0,57036213000 0,57142914000 0,57036215000 0,57036216000 0,57036217000 0,57142918000 0,57462719000 0,57462720000 0,57249521000 0,57462722000 0,57569323000 0,57569324000 0,57569325000 0,57569326000 0,57569327000 0,57569328000 0,57569329000 0,57569330000 0,57569331000 0,57569332000 0,57569333000 0,57462734000 0,57462735000 0,57356136000 0,57356137000 0,57356138000 0,57356139000 0,57356140000 0,57249541000 0,57249542000 0,572495
Table A.1: Accuracy of ML model after n steps.
Ap
pen
dix
B
So
urc
eC
od
e
B.1
Ch
ar
for
Ch
ar
B.1
.1p
re
ma
de
_e
sti
ma
tor.
py
#C
opyr
igh
t20
16Th
eT
enso
rFlo
wA
uth
ors
.A
llR
igh
tsR
eser
ved
.# #
Lic
ense
du
nder
the
Apa
che
Lic
ense
,V
ersi
on2.
0(t
he
"Lic
ense
");
#yo
um
ayn
otu
seth
isfi
leex
cep
tin
com
pli
ance
wit
hth
eL
icen
se.
#Y
oum
ayo
bta
ina
cop
yo
fth
eL
icen
seat
# #h
ttp
://w
ww
.ap
ach
e.o
rg/
lice
nse
s/LI
CEN
SE�
2.0
24
APPENDIX B. SOURCE CODE 25
# #U
nle
ssre
qu
ired
byap
pli
cab
lela
wor
agre
edto
inw
riti
ng
,so
ftw
are
#d
istr
ibu
ted
und
erth
eL
icen
seis
dis
trib
ute
don
an"A
SIS
"BA
SIS
,#
WIT
HO
UT
WA
RRA
NTI
ESO
RC
ON
DIT
ION
SO
FA
NY
KIN
D,
eith
erex
pre
ssor
imp
lied
.#
See
the
Lic
ense
for
the
spe
cifi
cla
ngu
age
gov
ern
ing
per
mis
sio
ns
and
#li
mit
ati
on
su
nder
the
Lic
ense
.""
"An
Exa
mpl
eo
fa
DN
NC
lass
ifie
rfo
rth
eIr
isd
atas
et."
""fr
om__
futu
re__
imp
ort
abso
lute
_im
po
rtfr
om__
futu
re__
imp
ort
div
isio
nfr
om__
futu
re__
imp
ort
pri
nt_
fun
ctio
n
imp
ort
arg
par
seim
por
tte
nso
rflo
was
tf
imp
ort
iris
_d
ata
par
ser
=ar
gp
arse
.Arg
um
entP
arse
r()
par
ser
.ad
d_a
rgu
men
t(’�
�b
atch
_siz
e’,
def
ault
=10
0,
typ
e=in
t,
hel
p=
’bat
chsi
ze’)
par
ser
.ad
d_a
rgu
men
t(’�
�tr
ain
_ste
ps
’,d
efau
lt=
1000
,ty
pe=
int
,h
elp
=’n
umbe
ro
ftr
ain
ing
step
s’)
par
ser
.ad
d_a
rgu
men
t(’�
�m
odel
_dir
’,d
efau
lt=
’mod
els/
cfc
’,ty
pe=
str
,
26 APPENDIX B. SOURCE CODE
hel
p=
’dir
ecto
ryto
sav
em
odel
chec
kp
oin
ts’)
def
mai
n(a
rgv
):ar
gs
=p
arse
r.p
arse
_arg
s(a
rgv
[1:]
)
#F
etch
the
dat
a(t
rain
_x,
trai
n_y
),(t
est_
x,
test
_y)
=ir
is_
da
ta.l
oad
_dat
a()
#F
eatu
reco
lum
nsd
escr
ibe
how
tou
seth
ein
pu
t.m
y_fe
atu
re_c
olu
mn
s=
[]fo
rke
yin
trai
n_x
.key
s()
:m
y_fe
atu
re_c
olu
mn
s.a
ppen
d(
tf.f
eatu
re_c
olu
mn
.in
dic
ato
r_co
lum
n(
tf.f
eatu
re_c
olu
mn
.cat
ego
rica
l_co
lum
n_w
ith
_vo
cab
ula
ry_f
ile
(ke
y=ke
y,
vo
cab
ula
ry_f
ile=
iris
_d
ata
.VO
CABU
LARY
_PA
TH))
)
#B
uil
d2
hid
den
lay
erDN
Nw
ith
10,
10u
nit
sre
spec
tiv
ely
.cl
ass
ifie
r=
tf.e
stim
ato
r.D
NN
Cla
ssif
ier(
feat
ure
_co
lum
ns=
my_
feat
ure
_col
um
ns
,#
Two
hid
den
lay
ers
of
10no
des
each
.h
idd
en_u
nit
s=[1
0,
10],
#T
hem
odel
mus
tch
oose
betw
een
3cl
ass
es.
n_c
lass
es=
2,
mod
el_d
ir=
arg
s.m
odel
_dir
)
APPENDIX B. SOURCE CODE 27
wh
ile
Tru
e:
#T
rain
the
Mod
el.
cla
ssif
ier
.tra
in(
inp
ut_
fn=l
ambd
a:i
ris_
da
ta.t
rain
_in
pu
t_fn
(tra
in_x
,tr
ain
_y,
arg
s.b
atch
_siz
e),
step
s=ar
gs
.tra
in_
step
s)
#E
val
uat
eth
em
odel
.ev
al_
resu
lt=
cla
ssif
ier
.ev
alu
ate
(in
pu
t_fn
=lam
bda
:iri
s_d
ata
.ev
al_i
np
ut_
fn(t
est_
x,
test
_y,
arg
s.b
atch
_siz
e))
pri
nt(
’{ac
cura
cy:f
}\n
’.fo
rmat
(⇤⇤
eva
l_re
sult
))
if__
nam
e__
==’_
_mai
n__
’:tf
.lo
gg
ing
.set
_ver
bo
sity
(tf
.lo
gg
ing
.FA
TAL
)tf
.app
.ru
n(m
ain
)
B.1
.2ir
is_
da
ta.p
y
imp
ort
pan
das
aspd
imp
ort
ten
sorf
low
astf
imp
ort
csv
28 APPENDIX B. SOURCE CODE
TRA
IN_P
ATH
=".
./re
s/tr
ain
.csv
"TE
ST_P
ATH
=".
./re
s/te
st.c
sv"
HEA
DER
S_PA
TH=
"../
res/
hea
der
s.c
sv"
VOCA
BULA
RY_P
ATH
=".
./re
s/v
ocab
ula
ry.t
xt"
SPE
CIE
S=
[’Se
tosa
’,’V
ersi
colo
r’,
’Vir
gin
ica
’]
def
load
_col
um
n_na
mes
():
hea
der
s=
[]
wit
hop
en(H
EAD
ERS_
PATH
,’r
’)as
hea
der
s_fi
le:
read
er=
csv
.rea
der
(hea
der
s_fi
le)
for
row
inre
ader
:h
ead
ers
.app
end
(row
)
retu
rnh
ead
ers
[0]
def
load
_dat
a(y
_nam
e=’l
abel
’):
"""R
etu
rns
the
iris
dat
aset
as(t
rain
_x,
trai
n_y
),(t
est_
x,
test
_y).
"""
col_
nam
es=
load
_col
um
n_na
mes
()
tra
in=
pd.r
ead
_csv
(TRA
IN_P
ATH
,na
mes
=co
l_n
ames
,h
ead
er=
0,
qu
otin
g=
csv
.QU
OTE
_ALL
)
APPENDIX B. SOURCE CODE 29
tra
in.p
op("
13
11
")#T
OD
Odo
not
pop
col
,bu
tg
otna
nin
ittr
ain
_x,
trai
n_y
=tr
ain
,tr
ain
.pop
(y_n
ame)
test
=pd
.rea
d_c
sv(T
EST_
PATH
,na
mes
=co
l_n
ames
,h
ead
er=
0,
qu
otin
g=
csv
.QU
OTE
_ALL
)te
st.p
op("
13
11
")#T
OD
Odo
not
pop
col
,bu
tg
otna
nin
itte
st_x
,te
st_y
=te
st,
test
.pop
(y_n
ame)
retu
rn(t
rain
_x,
trai
n_y
),(t
est_
x,
test
_y)
def
trai
n_i
np
ut_
fn(f
eatu
res
,la
bel
s,
bat
ch_s
ize
):""
"An
inp
ut
fun
ctio
nfo
rtr
ain
ing
"""
#C
onve
rtth
ein
pu
tsto
aD
atas
et.
dat
aset
=tf
.dat
a.D
atas
et.f
rom
_ten
sor_
slic
es((
dic
t(fe
atu
res
),la
bel
s))
#Sh
uff
le,
rep
eat
,an
db
atch
the
exam
ple
s.
dat
aset
=d
atas
et.s
hu
ffle
(50
00
).re
pea
t()
.bat
ch(b
atch
_siz
e)
#R
etu
rnth
ed
atas
et.
retu
rnd
atas
et
def
eval
_in
pu
t_fn
(fea
ture
s,
lab
els
,b
atch
_siz
e):
"""A
nin
pu
tfu
nct
ion
for
eval
uat
ion
orp
red
icti
on
"""
30 APPENDIX B. SOURCE CODE
feat
ure
s=d
ict(
feat
ure
s)
ifla
bel
sis
Non
e:
#N
ola
bel
s,
use
only
feat
ure
s.
inp
uts
=fe
atu
res
else
: inp
uts
=(f
eatu
res
,la
bel
s)
#C
onve
rtth
ein
pu
tsto
aD
atas
et.
dat
aset
=tf
.dat
a.D
atas
et.f
rom
_ten
sor_
slic
es(i
np
uts
)
#B
atch
the
exam
ple
sa
sser
tb
atch
_siz
eis
not
Non
e,"b
atch
_siz
em
ust
not
beN
one"
dat
aset
=d
atas
et.b
atch
(bat
ch_s
ize
)
#R
etu
rnth
ed
atas
et.
retu
rnd
atas
et
#Th
ere
mai
nd
ero
fth
isfi
leco
nta
ins
asi
mp
leex
amp
leo
fa
csv
par
ser
,#
imp
lem
ente
du
sin
ga
the
‘Dat
aset
‘cl
ass
.
#‘t
f.p
arse
_csv
‘se
tsth
ety
pes
of
the
outp
uts
tom
atch
the
exam
ple
sg
iven
in#
the
‘rec
ord
_def
ault
s‘
argu
men
t.C
SV_T
YPE
S=
[[0
.0]
,[0
.0]
,[0
.0]
,[0
.0]
,[0
]]
APPENDIX B. SOURCE CODE 31
def
_par
se_l
ine
(lin
e):
#D
ecod
eth
eli
ne
into
its
fie
lds
fie
lds
=tf
.dec
ode_
csv
(lin
e,
reco
rd_d
efau
lts=
CSV
_TY
PES
)
#P
ack
the
resu
ltin
toa
dic
tio
nar
yfe
atu
res
=d
ict(
zip
(loa
d_c
olu
mn_
nam
es()
,fi
eld
s))
#Se
par
ate
the
lab
elfr
omth
efe
atu
res
lab
el=
feat
ure
s.p
op(’
Spec
ies
’)
retu
rnfe
atu
res
,la
bel
def
csv
_in
pu
t_fn
(csv
_pat
h,
bat
ch_s
ize
):#
Cre
ate
ad
atas
etco
nta
inin
gth
ete
xt
lin
es.
dat
aset
=tf
.dat
a.T
extL
ineD
atas
et(c
sv_p
ath
).sk
ip(1
)
#P
arse
each
lin
e.
dat
aset
=d
atas
et.m
ap(_
par
se_l
ine
)
#Sh
uff
le,
rep
eat
,an
db
atch
the
exam
ple
s.
dat
aset
=d
atas
et.s
hu
ffle
(50
00
).re
pea
t()
.bat
ch(b
atch
_siz
e)
32 APPENDIX B. SOURCE CODE
#R
etu
rnth
ed
atas
et.
retu
rnd
atas
et
B.1
.3p
re
_p
roc
.py
imp
ort
csv
TXT
=’.
./re
s/v
ocab
ula
ry.t
xt’
PRO
CES
SED
_DA
TA_C
SV=
’../
res/
pro
cess
ed_d
ata
.csv
’
RES_
HEA
DER
S_C
SV=
’../
res/
hea
der
s.c
sv’
SOU
RCE
=".
./re
s/sa
rcas
m_v
2.c
sv"
max
_len
gth
=0
skip
ped
_hea
din
gs
=F
alse
wit
hop
en(S
OU
RCE
,’r
’)as
corp
us
:sa
rcas
m_r
ead
er=
csv
.rea
der
(cor
pu
s,
del
imit
er=
’,’)
for
row
insa
rcas
m_r
ead
er:
ifn
otsk
ipp
ed_h
ead
ing
s:
skip
ped
_hea
din
gs
=T
rue
con
tin
ue
APPENDIX B. SOURCE CODE 33
tex
t=
row
[4]
ifle
n(t
ext)
>m
ax_l
engt
h:
max
_len
gth
=le
n(t
ext)
skip
ped
_hea
din
gs
=F
alse
pro
cess
ed_r
ows
=[]
wit
hop
en(S
OU
RCE
,’r
’)as
corp
us
:sa
rcas
m_r
ead
er=
csv
.rea
der
(cor
pu
s,
del
imit
er=
’,’)
for
row
insa
rcas
m_r
ead
er:
ifn
otsk
ipp
ed_h
ead
ing
s:
skip
ped
_hea
din
gs
=T
rue
con
tin
ue
tex
t=
row
[4]
lab
el=
row
[1]
tex
t_le
n=
len
(tex
t)
pro
cess
ed_r
ow=
[]
ifla
bel
=="n
ots
arc
":p
roce
ssed
_row
.app
end
(0)
else
: pro
cess
ed_r
ow.a
ppen
d(1
)
34 APPENDIX B. SOURCE CODE
for
char
inte
xt
:p
roce
ssed
_row
.app
end
(ch
ar)
wh
ile
len
(pro
cess
ed_r
ow)
<m
ax_l
engt
h:
pro
cess
ed_r
ow.a
ppen
d(’
’)
pro
cess
ed_r
ows
.app
end
(pro
cess
ed_r
ow)
hea
der
s=
[]
hea
der
s.a
ppen
d("
lab
el")
for
iin
ran
ge(m
ax_l
engt
h):
hea
der
s.a
ppen
d(i
)
wit
hop
en(R
ES_H
EAD
ERS_
CSV
,’w
’)as
hea
der
s_fi
le:
wri
ter
=cs
v.w
rite
r(h
ead
ers_
file
)w
rite
r.w
rite
row
(hea
der
s)
wit
hop
en(P
ROC
ESSE
D_D
ATA
_CSV
,’w
’)as
sav
e_fi
le:
wri
ter
=cs
v.w
rite
r(s
ave_
file
,d
elim
iter
=’,
’,q
uot
ing
=cs
v.Q
UO
TE_A
LL)
APPENDIX B. SOURCE CODE 35
wri
ter
.wri
tero
w(h
ead
ers
)
for
pro
cess
ed_r
owin
pro
cess
ed_r
ows
:w
rite
r.w
rite
row
(pro
cess
ed_r
ow)
voc
={}
for
row
inp
roce
ssed
_row
s:
for
iin
ran
ge(1
,le
n(r
ow))
:vo
c[ro
w[i
]]=
0
wit
hop
en(T
XT
,’w
’)as
vo
c_fi
le:
keys
=li
st(v
oc.k
eys
())
keys
.so
rt()
for
key
inke
ys:
vo
c_fi
le.w
rite
(key
)v
oc_
file
.wri
te("
\n
")
B.1
.4tr
ain
_te
st_
sp
litt
er.
py
imp
ort
csv
imp
ort
rand
om
RES
_TES
T_C
SV=
’../
res/
test
.csv
’
36 APPENDIX B. SOURCE CODE
RES
_TR
AIN
_CSV
=’.
./re
s/tr
ain
.csv
’
PRO
CES
SED
_DA
TA_C
SV=
’../
res/
pro
cess
ed_d
ata
.csv
’
wit
hop
en(P
ROC
ESSE
D_D
ATA
_CSV
,’r
’)as
pro
c_d
ata_
file
:re
ader
=cs
v.r
ead
er(p
roc_
dat
a_fi
le,
del
imit
er=
’,’,
qu
otin
g=
csv
.QU
OTE
_ALL
)
dat
a=
[]fo
rro
win
read
er:
dat
a.a
ppen
d(r
ow)
hea
der
s=
dat
a.p
op(0
)
no
nsa
rc_d
ata
=li
st(
filt
er
(lam
bda
x:
x[0
]==
’0’,
dat
a))
sarc
_dat
a=
list
(fi
lte
r(l
ambd
ax
:x
[0]
==’1
’,d
ata
))
amo
un
t_o
f_n
on
sarc
_tra
in=
rou
nd(l
en(n
on
sarc
_dat
a)
⇤0
.8)
amo
un
t_o
f_n
on
sarc
_tes
t=
len
(no
nsa
rc_d
ata
)�
amo
un
t_o
f_n
on
sarc
_tra
in
amo
un
t_o
f_sa
rc_t
rain
=ro
und
(len
(sar
c_d
ata
)⇤
0.8
)am
ou
nt_
of_
sarc
_tes
t=
len
(sar
c_d
ata
)�
amo
un
t_o
f_sa
rc_t
rain
wit
hop
en(R
ES_T
RA
IN_C
SV,
’w’)
astr
ain
_fi
le:
wri
ter
=cs
v.w
rite
r(t
rain
_fi
le,
del
imit
er=
’,’,
qu
otin
g=
csv
.QU
OTE
_ALL
)
APPENDIX B. SOURCE CODE 37
wri
ter
.wri
tero
w(h
ead
ers
)
wri
tten
_no
nsa
rc=
0
wh
ile
wri
tten
_no
nsa
rc<
amo
un
t_o
f_n
on
sarc
_tra
in:
ind
ex=
rand
om.r
and
int(
0,l
en(n
on
sarc
_dat
a)�
1)ro
w=
no
nsa
rc_d
ata
.pop
(in
dex
)w
rite
r.w
rite
row
(row
)w
ritt
en_n
on
sarc
+=1
wri
tten
_sar
c=
0
wh
ile
wri
tten
_sar
c<
amo
un
t_o
f_sa
rc_t
rain
:in
dex
=ra
ndom
.ran
din
t(0
,len
(sar
c_d
ata
)�
1)ro
w=
sarc
_dat
a.p
op(i
nd
ex)
wri
ter
.wri
tero
w(r
ow)
wri
tten
_sar
c+=
1
wit
hop
en(R
ES_T
EST_
CSV
,’w
’)as
test
_fi
le:
wri
ter
=cs
v.w
rite
r(t
est
_fi
le,
del
imit
er=
’,’,
qu
otin
g=
csv
.QU
OTE
_ALL
)
wri
ter
.wri
tero
w(h
ead
ers
)
38 APPENDIX B. SOURCE CODE
wri
tten
_no
nsa
rc=
0
wh
ile
wri
tten
_no
nsa
rc<
amo
un
t_o
f_n
on
sarc
_tes
t:in
dex
=ra
ndom
.ran
din
t(0
,len
(no
nsa
rc_d
ata
)�
1)ro
w=
no
nsa
rc_d
ata
.pop
(in
dex
)w
rite
r.w
rite
row
(row
)w
ritt
en_n
on
sarc
+=1
wri
tten
_sar
c=
0
wh
ile
wri
tten
_sar
c<
amo
un
t_o
f_sa
rc_t
est:
ind
ex=
rand
om.r
and
int(
0,l
en(s
arc_
dat
a)�
1)ro
w=
sarc
_dat
a.p
op(i
nd
ex)
wri
ter
.wri
tero
w(r
ow)
wri
tten
_sar
c+=
1
B.2
big
ram
B.2
.1p
re
ma
de
_e
sti
ma
tor.
py
#C
opyr
igh
t20
16Th
eT
enso
rFlo
wA
uth
ors
.A
llR
igh
tsR
eser
ved
.# #
Lic
ense
du
nder
the
Apa
che
Lic
ense
,V
ersi
on2.
0(t
he
"Lic
ense
");
#yo
um
ayn
otu
seth
isfi
leex
cep
tin
com
pli
ance
wit
hth
eL
icen
se.
APPENDIX B. SOURCE CODE 39
#Y
oum
ayo
bta
ina
cop
yo
fth
eL
icen
seat
# #h
ttp
://w
ww
.ap
ach
e.o
rg/
lice
nse
s/LI
CEN
SE�
2.0
# #U
nle
ssre
qu
ired
byap
pli
cab
lela
wor
agre
edto
inw
riti
ng
,so
ftw
are
#d
istr
ibu
ted
und
erth
eL
icen
seis
dis
trib
ute
don
an"A
SIS
"BA
SIS
,#
WIT
HO
UT
WA
RRA
NTI
ESO
RC
ON
DIT
ION
SO
FA
NY
KIN
D,
eith
erex
pre
ssor
imp
lied
.#
See
the
Lic
ense
for
the
spe
cifi
cla
ngu
age
gov
ern
ing
per
mis
sio
ns
and
#li
mit
ati
on
su
nder
the
Lic
ense
.""
"An
Exa
mpl
eo
fa
DN
NC
lass
ifie
rfo
rth
eIr
isd
atas
et."
""fr
om__
futu
re__
imp
ort
abso
lute
_im
po
rtfr
om__
futu
re__
imp
ort
div
isio
nfr
om__
futu
re__
imp
ort
pri
nt_
fun
ctio
n
imp
ort
arg
par
seim
por
tte
nso
rflo
was
tf
imp
ort
iris
_d
ata
par
ser
=ar
gp
arse
.Arg
um
entP
arse
r()
par
ser
.ad
d_a
rgu
men
t(’�
�b
atch
_siz
e’,
def
ault
=10
0,
typ
e=in
t,
hel
p=
’bat
chsi
ze’)
par
ser
.ad
d_a
rgu
men
t(’�
�tr
ain
_ste
ps
’,d
efau
lt=
1000
,ty
pe=
int
,h
elp
=’n
umbe
ro
ftr
ain
ing
step
s’)
40 APPENDIX B. SOURCE CODE
par
ser
.ad
d_a
rgu
men
t(’�
�m
odel
_dir
’,d
efau
lt=
’mod
els/
bow
’,ty
pe=
str
,h
elp
=’d
irec
tory
tosa
ve
mod
elch
eck
po
ints
’)
def
mai
n(a
rgv
):ar
gs
=p
arse
r.p
arse
_arg
s(a
rgv
[1:]
)
#F
etch
the
dat
a(t
rain
_x,
trai
n_y
),(t
est_
x,
test
_y)
=ir
is_
da
ta.l
oad
_dat
a()
#F
eatu
reco
lum
nsd
escr
ibe
how
tou
seth
ein
pu
t.m
y_fe
atu
re_c
olu
mn
s=
[]fo
rke
yin
trai
n_x
.key
s()
:m
y_fe
atu
re_c
olu
mn
s.a
ppen
d(
tf.f
eatu
re_c
olu
mn
.nu
mer
ic_c
olu
mn
(key
=ke
y))
#B
uil
d2
hid
den
lay
erDN
Nw
ith
10,
10u
nit
sre
spec
tiv
ely
.cl
ass
ifie
r=
tf.e
stim
ato
r.D
NN
Cla
ssif
ier(
feat
ure
_co
lum
ns=
my_
feat
ure
_col
um
ns
,#
Two
hid
den
lay
ers
of
10no
des
each
.h
idd
en_u
nit
s=[1
0,
10],
#T
hem
odel
mus
tch
oose
betw
een
3cl
ass
es.
n_c
lass
es=
2,
mod
el_d
ir=
arg
s.m
odel
_dir
)
APPENDIX B. SOURCE CODE 41
wh
ile
Tru
e:
#T
rain
the
Mod
el.
cla
ssif
ier
.tra
in(
inp
ut_
fn=l
ambd
a:i
ris_
da
ta.t
rain
_in
pu
t_fn
(tra
in_x
,tr
ain
_y,
arg
s.b
atch
_siz
e),
step
s=ar
gs
.tra
in_
step
s)
#E
val
uat
eth
em
odel
.ev
al_
resu
lt=
cla
ssif
ier
.ev
alu
ate
(in
pu
t_fn
=lam
bda
:iri
s_d
ata
.ev
al_i
np
ut_
fn(t
est_
x,
test
_y,
arg
s.b
atch
_siz
e))
pri
nt(
’{ac
cura
cy:f
}\n
’.fo
rmat
(⇤⇤
eva
l_re
sult
))
if__
nam
e__
==’_
_mai
n__
’:tf
.lo
gg
ing
.set
_ver
bo
sity
(tf
.lo
gg
ing
.IN
FO)
tf.a
pp.r
un
(mai
n)
B.2
.2ir
is_
da
ta.p
y
imp
ort
pan
das
aspd
imp
ort
ten
sorf
low
astf
imp
ort
csv
42 APPENDIX B. SOURCE CODE
TRA
IN_P
ATH
=".
./re
s/tr
ain
.csv
"TE
ST_P
ATH
=".
./re
s/te
st.c
sv"
HEA
DER
S_PA
TH=
"../
res/
hea
der
s.c
sv"
VOCA
BULA
RY_P
ATH
=".
./re
s/v
ocab
ula
ry.t
xt"
def
load
_col
um
n_na
mes
():
hea
der
s=
[]
wit
hop
en(H
EAD
ERS_
PATH
,’r
’)as
hea
der
s_fi
le:
read
er=
csv
.rea
der
(hea
der
s_fi
le)
for
row
inre
ader
:h
ead
ers
.app
end
(row
)
retu
rnh
ead
ers
[0]
def
load
_dat
a(y
_nam
e=’l
abel
’):
"""R
etu
rns
the
iris
dat
aset
as(t
rain
_x,
trai
n_y
),(t
est_
x,
test
_y).
"""
col_
nam
es=
load
_col
um
n_na
mes
()
tra
in=
pd.r
ead
_csv
(TRA
IN_P
ATH
,na
mes
=co
l_n
ames
,h
ead
er=
0,
qu
otin
g=
csv
.QU
OTE
_ALL
)tr
ain
_x,
trai
n_y
=tr
ain
,tr
ain
.pop
(y_n
ame)
APPENDIX B. SOURCE CODE 43
test
=pd
.rea
d_c
sv(T
EST_
PATH
,na
mes
=co
l_n
ames
,h
ead
er=
0,
qu
otin
g=
csv
.QU
OTE
_ALL
)te
st_x
,te
st_y
=te
st,
test
.pop
(y_n
ame)
retu
rn(t
rain
_x,
trai
n_y
),(t
est_
x,
test
_y)
def
trai
n_i
np
ut_
fn(f
eatu
res
,la
bel
s,
bat
ch_s
ize
):""
"An
inp
ut
fun
ctio
nfo
rtr
ain
ing
"""
#C
onve
rtth
ein
pu
tsto
aD
atas
et.
dat
aset
=tf
.dat
a.D
atas
et.f
rom
_ten
sor_
slic
es((
dic
t(fe
atu
res
),la
bel
s))
#Sh
uff
le,
rep
eat
,an
db
atch
the
exam
ple
s.
dat
aset
=d
atas
et.s
hu
ffle
(50
00
).re
pea
t()
.bat
ch(b
atch
_siz
e)
#R
etu
rnth
ed
atas
et.
retu
rnd
atas
et
def
eval
_in
pu
t_fn
(fea
ture
s,
lab
els
,b
atch
_siz
e):
"""A
nin
pu
tfu
nct
ion
for
eval
uat
ion
orp
red
icti
on
"""
feat
ure
s=d
ict(
feat
ure
s)
ifla
bel
sis
Non
e:
#N
ola
bel
s,
use
only
feat
ure
s.
inp
uts
=fe
atu
res
44 APPENDIX B. SOURCE CODE
else
: inp
uts
=(f
eatu
res
,la
bel
s)
#C
onve
rtth
ein
pu
tsto
aD
atas
et.
dat
aset
=tf
.dat
a.D
atas
et.f
rom
_ten
sor_
slic
es(i
np
uts
)
#B
atch
the
exam
ple
sa
sser
tb
atch
_siz
eis
not
Non
e,"b
atch
_siz
em
ust
not
beN
one"
dat
aset
=d
atas
et.b
atch
(bat
ch_s
ize
)
#R
etu
rnth
ed
atas
et.
retu
rnd
atas
et
#Th
ere
mai
nd
ero
fth
isfi
leco
nta
ins
asi
mp
leex
amp
leo
fa
csv
par
ser
,#
imp
lem
ente
du
sin
ga
the
‘Dat
aset
‘cl
ass
.
#‘t
f.p
arse
_csv
‘se
tsth
ety
pes
of
the
outp
uts
tom
atch
the
exam
ple
sg
iven
in#
the
‘rec
ord
_def
ault
s‘
argu
men
t.C
SV_T
YPE
S=
[[0
.0]
,[0
.0]
,[0
.0]
,[0
.0]
,[0
]]
def
_par
se_l
ine
(lin
e):
#D
ecod
eth
eli
ne
into
its
fie
lds
fie
lds
=tf
.dec
ode_
csv
(lin
e,
reco
rd_d
efau
lts=
CSV
_TY
PES
)
APPENDIX B. SOURCE CODE 45
#P
ack
the
resu
ltin
toa
dic
tio
nar
yfe
atu
res
=d
ict(
zip
(loa
d_c
olu
mn_
nam
es()
,fi
eld
s))
#Se
par
ate
the
lab
elfr
omth
efe
atu
res
lab
el=
feat
ure
s.p
op(’
Spec
ies
’)
retu
rnfe
atu
res
,la
bel
def
csv
_in
pu
t_fn
(csv
_pat
h,
bat
ch_s
ize
):#
Cre
ate
ad
atas
etco
nta
inin
gth
ete
xt
lin
es.
dat
aset
=tf
.dat
a.T
extL
ineD
atas
et(c
sv_p
ath
).sk
ip(1
)
#P
arse
each
lin
e.
dat
aset
=d
atas
et.m
ap(_
par
se_l
ine
)
#Sh
uff
le,
rep
eat
,an
db
atch
the
exam
ple
s.
dat
aset
=d
atas
et.s
hu
ffle
(50
00
).re
pea
t()
.bat
ch(b
atch
_siz
e)
#R
etu
rnth
ed
atas
et.
retu
rnd
atas
et
B.2
.3p
re
_p
roc
.py
46 APPENDIX B. SOURCE CODE
imp
ort
csv
imp
ort
sys
PRO
CES
SED
_DA
TA_C
SV=
’../
res/
pro
cess
ed_d
ata
.csv
’
VO
CABU
LARY
_TXT
=’.
./re
s/v
ocab
ula
ry.t
xt’
RES_
HEA
DER
S_C
SV=
’../
res/
hea
der
s.c
sv’
SARC
ASM
_V__
CSV
=’.
./re
s/sa
rcas
m_v
2.c
sv’
RES_
SARC
ASM
_V__
CSV
=’.
./re
s/sa
rcas
m_v
2.c
sv’
skip
ped
_hea
din
gs
=F
alse
wit
hop
en(R
ES_S
ARC
ASM
_V__
CSV
,’r
’)as
corp
us
:sa
rcas
m_r
ead
er=
csv
.rea
der
(cor
pu
s,
del
imit
er=
’,’)
text
_acc
um
ula
tor
=’’
for
row
insa
rcas
m_r
ead
er:
ifn
otsk
ipp
ed_h
ead
ing
s:
skip
ped
_hea
din
gs
=T
rue
con
tin
ue
APPENDIX B. SOURCE CODE 47
text
_acc
um
ula
tor
+=’
’+
row
[4]
voca
b=
{wor
dfo
rw
ord
inte
xt_a
ccu
mu
lato
r.s
pli
t()
}
pri
nt(
"Don
eb
uil
din
gvo
cab
")p
rin
t(le
n(v
ocab
),"
un
iqu
ew
ords
")
skip
ped
_hea
din
gs
=F
alse
pro
cess
ed_r
ows
=[]
wit
hop
en(S
ARC
ASM
_V__
CSV
,’r
’)as
corp
us
:sa
rcas
m_r
ead
er=
csv
.rea
der
(cor
pu
s,
del
imit
er=
’,’)
i=
0fo
rro
win
sarc
asm
_rea
der
:i
+=1
sys
.std
ou
t.w
rite
(’\
rrow
{0:0
5d
}’.
form
at(i
))sy
s.s
tdo
ut.
flu
sh()
ifn
otsk
ipp
ed_h
ead
ing
s:
skip
ped
_hea
din
gs
=T
rue
con
tin
ue
tex
t=
row
[4]
lab
el=
row
[1]
pro
cess
ed_r
ow=
[]
48 APPENDIX B. SOURCE CODE
ifla
bel
=="n
ots
arc
":p
roce
ssed
_row
.app
end
(0)
else
: pro
cess
ed_r
ow.a
ppen
d(1
)
loca
l_v
oca
b=
{}
wor
ds=
tex
t.s
pli
t()
for
wor
din
wor
ds:
ifw
ord
not
inlo
cal_
vo
cab
:lo
cal_
vo
cab
[wor
d]
=1
else
: loca
l_v
oca
b[w
ord
]=
loca
l_v
oca
b[w
ord
]+
1
for
wor
din
voca
b:
ifw
ord
not
inlo
cal_
vo
cab
:lo
cal_
vo
cab
[wor
d]
=0
for
wor
din
sort
ed(l
oca
l_v
oca
b.k
eys
()):
pro
cess
ed_r
ow.a
ppen
d(l
oca
l_v
oca
b[w
ord
])
pro
cess
ed_r
ows
.app
end
(pro
cess
ed_r
ow)
APPENDIX B. SOURCE CODE 49
pri
nt(
"\nD
one
cou
nti
ng
occ
ure
nce
s")
hea
der
s=
["la
bel
"]
for
wor
din
sort
ed(v
ocab
):h
ead
ers
.app
end
(wor
ds)
wit
hop
en(
’../
res/
real
_hea
der
s.c
sv’,
’w’)
ash
ead
ers_
file
:w
rite
r=
csv
.wri
ter
(hea
der
s_fi
le)
wri
ter
.wri
tero
w(h
ead
ers
)
hea
der
s=
["la
bel
"]
i=
0fo
rw
ord
inso
rted
(voc
ab):
hea
der
s.a
ppen
d(i
)i
+=1
wit
hop
en(R
ES_H
EAD
ERS_
CSV
,’w
’)as
hea
der
s_fi
le:
wri
ter
=cs
v.w
rite
r(h
ead
ers_
file
)w
rite
r.w
rite
row
(hea
der
s)
pri
nt(
"Hea
der
ssa
ved
")
50 APPENDIX B. SOURCE CODE
wit
hop
en(P
ROC
ESSE
D_D
ATA
_CSV
,’w
’)as
sav
e_fi
le:
wri
ter
=cs
v.w
rite
r(s
ave_
file
,d
elim
iter
=’,
’,q
uot
ing
=cs
v.Q
UO
TE_A
LL)
wri
ter
.wri
tero
w(h
ead
ers
)
for
pro
cess
ed_r
owin
pro
cess
ed_r
ows
:w
rite
r.w
rite
row
(pro
cess
ed_r
ow)
pri
nt(
"Dat
asa
ved
")
wit
hop
en(V
OCA
BULA
RY_T
XT,
’w’)
asv
oc_
file
:ke
ys=
list
(voc
ab)
keys
.so
rt()
for
key
inke
ys:
vo
c_fi
le.w
rite
(str
(key
))v
oc_
file
.wri
te("
\n
")
B.2
.4tr
ain
_te
st_
sp
litt
er.
py
imp
ort
csv
imp
ort
rand
om
RES
_TES
T_C
SV=
’../
res/
test
.csv
’
APPENDIX B. SOURCE CODE 51
RES
_TR
AIN
_CSV
=’.
./re
s/tr
ain
.csv
’
PRO
CES
SED
_DA
TA_C
SV=
’../
res/
pro
cess
ed_d
ata
.csv
’
wit
hop
en(P
ROC
ESSE
D_D
ATA
_CSV
,’r
’)as
pro
c_d
ata_
file
:re
ader
=cs
v.r
ead
er(p
roc_
dat
a_fi
le,
del
imit
er=
’,’,
qu
otin
g=
csv
.QU
OTE
_ALL
)
dat
a=
[]fo
rro
win
read
er:
dat
a.a
ppen
d(r
ow)
hea
der
s=
dat
a.p
op(0
)
no
nsa
rc_d
ata
=li
st(
filt
er
(lam
bda
x:
x[0
]==
’0’,
dat
a))
sarc
_dat
a=
list
(fi
lte
r(l
ambd
ax
:x
[0]
==’1
’,d
ata
))
amo
un
t_o
f_n
on
sarc
_tra
in=
rou
nd(l
en(n
on
sarc
_dat
a)
⇤0
.8)
amo
un
t_o
f_n
on
sarc
_tes
t=
len
(no
nsa
rc_d
ata
)�
amo
un
t_o
f_n
on
sarc
_tra
in
amo
un
t_o
f_sa
rc_t
rain
=ro
und
(len
(sar
c_d
ata
)⇤
0.8
)am
ou
nt_
of_
sarc
_tes
t=
len
(sar
c_d
ata
)�
amo
un
t_o
f_sa
rc_t
rain
wit
hop
en(R
ES_T
RA
IN_C
SV,
’w’)
astr
ain
_fi
le:
wri
ter
=cs
v.w
rite
r(t
rain
_fi
le,
del
imit
er=
’,’,
qu
otin
g=
csv
.QU
OTE
_ALL
)
52 APPENDIX B. SOURCE CODE
wri
ter
.wri
tero
w(h
ead
ers
)
wri
tten
_no
nsa
rc=
0
wh
ile
wri
tten
_no
nsa
rc<
amo
un
t_o
f_n
on
sarc
_tra
in:
ind
ex=
rand
om.r
and
int(
0,l
en(n
on
sarc
_dat
a)�
1)ro
w=
no
nsa
rc_d
ata
.pop
(in
dex
)w
rite
r.w
rite
row
(row
)w
ritt
en_n
on
sarc
+=1
wri
tten
_sar
c=
0
wh
ile
wri
tten
_sar
c<
amo
un
t_o
f_sa
rc_t
rain
:in
dex
=ra
ndom
.ran
din
t(0
,len
(sar
c_d
ata
)�
1)ro
w=
sarc
_dat
a.p
op(i
nd
ex)
wri
ter
.wri
tero
w(r
ow)
wri
tten
_sar
c+=
1
wit
hop
en(R
ES_T
EST_
CSV
,’w
’)as
test
_fi
le:
wri
ter
=cs
v.w
rite
r(t
est
_fi
le,
del
imit
er=
’,’,
qu
otin
g=
csv
.QU
OTE
_ALL
)
wri
ter
.wri
tero
w(h
ead
ers
)
APPENDIX B. SOURCE CODE 53
wri
tten
_no
nsa
rc=
0
wh
ile
wri
tten
_no
nsa
rc<
amo
un
t_o
f_n
on
sarc
_tes
t:in
dex
=ra
ndom
.ran
din
t(0
,len
(no
nsa
rc_d
ata
)�
1)ro
w=
no
nsa
rc_d
ata
.pop
(in
dex
)w
rite
r.w
rite
row
(row
)w
ritt
en_n
on
sarc
+=1
wri
tten
_sar
c=
0
wh
ile
wri
tten
_sar
c<
amo
un
t_o
f_sa
rc_t
est:
ind
ex=
rand
om.r
and
int(
0,l
en(s
arc_
dat
a)�
1)ro
w=
sarc
_dat
a.p
op(i
nd
ex)
wri
ter
.wri
tero
w(r
ow)
wri
tten
_sar
c+=
1