sentiment analysis of twitter data - computer …szymansk/theses/bo.ms.16.pdf · sentiment analysis...
TRANSCRIPT
SENTIMENT ANALYSIS OF TWITTER DATA
By
Bo Yuan
A Thesis Submitted to the Graduate
Faculty of Rensselaer Polytechnic Institute
in Partial Fulfillment of the
Requirements for the Degree of
MASTER OF SCIENCE
Major Subject: COMPUTER SCIENCE
Examining Committee:
Boleslaw K. Szymanski, Thesis Adviser
Sibel Adali, Member
Malik Magdon-Ismail, Member
Rensselaer Polytechnic InstituteTroy, New York
March 2016(For Graduation May 2016)
c© Copyright 2016
by
Bo Yuan
All Rights Reserved
ii
CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
ACKNOWLEDGMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Sentiment Component . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 Levels of Study . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.3 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Sentiment Classification . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Lexicon-Based Methods . . . . . . . . . . . . . . . . . . . . . 7
2.2.1.1 Sentiment Lexicon . . . . . . . . . . . . . . . . . . . 7
2.2.1.2 Lexicon-Based Classification Algorithms . . . . . . . 9
2.2.2 Machine Learning-Based Methods . . . . . . . . . . . . . . . . 9
2.2.2.1 Supervised Learning Methods . . . . . . . . . . . . . 9
2.2.2.2 Unsupervised Learning Methods . . . . . . . . . . . 10
2.2.3 Rule-Based Methods . . . . . . . . . . . . . . . . . . . . . . . 10
3. Proposed Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1 Lexicon-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.1 Two Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.2 Basic Lexicon-Based Methods . . . . . . . . . . . . . . . . . . 14
3.1.3 Linguistic Rules . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.3.1 Negation . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.3.2 Valence Shifter . . . . . . . . . . . . . . . . . . . . . 18
3.1.3.3 Contrast . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.3.4 Linguistic Inference Rule . . . . . . . . . . . . . . . . 21
3.2 Machine Learning-Based Methods . . . . . . . . . . . . . . . . . . . . 22
3.2.1 Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.1.1 N-Grams . . . . . . . . . . . . . . . . . . . . . . . . 22
iii
3.2.1.2 Linguistic Features . . . . . . . . . . . . . . . . . . . 24
3.2.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4. Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1 Data-set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.1 Motivation for Data Gathering . . . . . . . . . . . . . . . . . 27
4.1.2 Raw Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1.3 Cleaning Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2 Lexicon-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.3 Rule-based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.4 Machine Learning-Based Methods . . . . . . . . . . . . . . . . . . . . 38
5.5 Evaluation Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
LITERATURE CITED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
APPENDICES
A. Linguistic Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
iv
LIST OF TABLES
3.1 MPQA Example Entries . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Sample SentiWordNet Entries . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Example VADER Sentiment Lexicon . . . . . . . . . . . . . . . . . . . 16
3.4 N-Gram Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1 Positive and Negative Emoticons . . . . . . . . . . . . . . . . . . . . . . 28
4.2 Topic Key Word(s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.1 Average Performance of Baseline Algorithms . . . . . . . . . . . . . . . 33
5.2 Best Performance of Lexicon-Based Methods Across Domains . . . . . . 34
5.3 Average Performance of Lexicon-Based Methods Across Domains . . . . 36
A.1 Valence Shifter Expressions . . . . . . . . . . . . . . . . . . . . . . . . . 52
v
LIST OF FIGURES
3.1 Sample Part-of-Speech Tagging . . . . . . . . . . . . . . . . . . . . . . . 24
5.1 Results of Baseline Algorithms . . . . . . . . . . . . . . . . . . . . . . . 33
5.2 Results of Lexicon-Based Algorithms . . . . . . . . . . . . . . . . . . . 35
5.3 Results of Rule-Based Methods . . . . . . . . . . . . . . . . . . . . . . . 37
5.4 Comparison of Best Performance with LIR Algorithm . . . . . . . . . . 37
5.5 Comparison of Average Performance with LIR Algorithm . . . . . . . . 38
5.6 Naive Bayes with N-Gram Bag-of-Words Features . . . . . . . . . . . . 39
5.7 Maximum Entropy with N-Gram Bag-of-Words Features . . . . . . . . 40
5.8 Support Vector Machines with N-Gram Bag-of-Words Features . . . . . 41
5.9 Average Performance of N-Gram Bag-of-Words Features . . . . . . . . . 42
5.10 Average Performance of Machine Learning Classifiers with LinguisticFeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.11 Comparison of Linguistic and Bag-of-Words Features . . . . . . . . . . 43
vi
ACKNOWLEDGMENT
I would like to express my gratitude to my adviser, Professor Boleslaw Szymanski for
his generous support and kind help during my graduate study at RPI. I would also
like to thank Professor Adali and Professor Magdon-Ismail for graciously serving on
my thesis committee.
I would offer a special “Thank you” for RPI, my dearest alma mater. Life at
Rensselaer has been so wonderful that I would never forget. Thank you, RPI for
helping me find my “inner engineer”.
vii
ABSTRACT
Sentiment Analysis and Opinion Mining has become a research hot-spot with the
rapid development of social network websites.Twitter is a typical social network ap-
plication with millions of users expressing their sentiment every day. In this work,
we explored comprehensively the methodologies applied in sentiment classification
over Twitter data: lexicon-based, rule-based and machine learning-based methods.
Our data-set is crawled and manually cleaned with the principle of Naturally An-
notated Big Data. The data-set contains 20, 000 tweets ranging over ten popular
topics.
For lexicon-based methods, we experimented with the Simple Word Count ap-
proach and Feature Scoring approach using most popular sentiment lexicons and
semantic resources, namely MPQA subjectivity lexicon, SentiWordNet, Vader Sen-
timent Lexicon, Bing Liu’s lexicon and General Inquirer. We built customized sen-
timent lexicons, designed featuring scores and compared ten classifiers on real-world
Twitter data. Further, we designed Lingusitic Inference Rules(LIR) to improve
lexicon-based classifiers. LIR aims to handle negation, valence shift and contrast
conjunctions in natural language. For machine learning-based methods, we used
state-of-the-art supervised learning models: Naive Bayes, Maximum Entropy and
Support Vector Machines. Two sets of features are compared. The first set of fea-
tures is Bag-of-Words with N-Gram. The second set of features is Part-of-Speech
linguistic annotation.
viii
1. Introduction
Sentiment and opinion are essential features of human existence. “What do we
think” and “how do we feel” play a vital role in our daily life. The decisions we
make are closely related to the emotion and attitude of both ourselves and others.
With the rapid development of web2.0, an increasing number of people are
expressing their opinions on-line. E-commerce websites typical examples. Amazon
encourages customers to create reviews and provide feedback about the products
and services they purchase. By rating the products on a 5-star scale and writing
several paragraphs of review, the Amazon shoppers are able to share information
on “what people like or do not like”.
Social network website is another example where user-generated opinionated
data abounds. Social network websites usually contain a great scope of topics,
especially those related to big news events. Twitter, for example, is one of the most
popular social network websites to which people turn when big events occur. In 2010,
after a catastrophic 7.0 magnitude earthquake hit Haiti, Twitter served as a major
hub of information. Twitter was proven to be an important tool for fund-raising
and relief efforts [1]. Twitter has even changed the outcome of many historical
events, especially in political elections where millions of voters tweet frequently to
openly express their political approval or contempt. In the 2008 presidential election,
Twitter was integrated into President Obama’s campaign, which later proved to be
a huge success inspiring nunmerous academic studies [2].
Sentiment analysis research goes hand in hand with the Internet boom. On the
one hand, applications of sentiment analysis provide significant commercial value.
On the other, sentiment analysis systems provide basis for academic research in
computer science, linguistics, social science, management science etc..
In this research, we will focus sentiment classification of twitter data. The
remainder of this thesis is structured as follows. Chapter 2 surveys the field of
study on definition, sub-tasks and methodologies. Chapter 3 illustrates our proposed
methods. In chapter 4, experiment settings are described. Results of experiments
1
2
are discussed in Chapter 5. Finally, Chapter 6 summarizes our contributions and
points out future research directions.
2. Related Work
2.1 Definition
According to Merriam-Webster dictionary [3], the word sentiment has three
layers of meanings:
• Predilection or opinion.
• Emotion or refined feeling.
• Idea colored by emotion.
By definition, all the automatic analysis of properties of such kind falls into the
range of sentiment analysis. According to Liu [4], sentiment analysis is the field of
study that analyzes peoples opinions, sentiments, evaluations, appraisal, attitudes,
and emotions towards entities such as products, services, organizations, individuals,
issues, events, topics and their attributes. The term can be used interchangeably
with Opinion Mining.
Farzindar [5] separates sentiment analysis and emotion analysis to emphasize
the subtle difference. Emotion analysis is more meticulously classified into finer
granularity. In [6], emotion is categorized into six class: anger, disgust, fear, joy,
sadness, surprise, which are most widely used in the literature. There is currently
no consensus on how many classes of emotions should be used. Emotion analysis is
also referred to as mood detection.
The distinction of sentiment analysis and emotion analysis is beyond the scope
of this work. In our research, all semantic orientations per three layers expressed
towards certain entities are counted as sentiment.
2.1.1 Sentiment Component
Sentiment can be divided into different components: holder, target, aspect
and polarity. Each component corresponds to specific tasks in a system.
Holder denotes the entity that holds the sentiment.
3
4
Target identifies the entity selected as the aim of the sentiment.
Polarity is the property of sentiment. Polarity can be two-folds (positive and
negative) or three-folds (positive, negative and neutral).
Aspect defines the particular part or feature of target that the sentiment is ex-
pressed towards.
Let us take the following sentence as an example1:
Steve Jobs said that Microsoft simply has no taste.
The sentiment in this sentence can be analyzed as the holder (“Steve Jobs”)
expressed opinions towards the target(“Microsoft”) and the polarity of the sentiment
is negative (“has no taste”).
Aspect is also an important sentiment component. Let us take the following
company review text as an example2:
Demandware as a company has a positive, people-centric, forward
thinking culture. The benefits and work life balance are great. But
cross-functional communication can be challenging.
The user has given an overall evaluation of Demandware with respects to four
aspects: culture, benefits, work-life balance and communication. While the first
three aspects receive positive evaluation, the last one receives negative evaluation
as described below:
Culture positive (“positive, people centric, forward thinking”).
Benefits positive (“great”).
Work-life balance positive(“great”).
Communication negative(“challenging”).
1http://www.computerworld.com/article/2471632/(Date Last Accessed, March 1, 2016)2https://www.glassdoor.com/(Date Last Accessed, March 1, 2016)
5
In summary, holder, target, polarity and aspect are four major components
of sentiment. They work together to convey sentiment expressed in natural lan-
guage. All of the components have attracted extensive studies in sentiment analysis
research.
2.1.2 Levels of Study
Sentiment analysis can be categorized according to the granularity of text.
Previous work mainly focuses on three level:
• Document/text level
The analysis of this level is to determine whether sentiment expressed in a
whole document is positive or negative. For example [7], given product re-
views, the system would be able to evaluate the overall sentiment polarity.
Document level analysis assumes a piece of text expresses sentiment towards
a single target. While this is usually true for product review, movie review,
restaurant reviews etc., it probably does not apply to situations where a doc-
ument criticizes multiple targets.
• Sentence level
The analysis of this level is to determine whether the opinions expressed in
a sentence is positive, negative or neutral. Sentence level analysis can be
conducted in two ways. One way is to simply regard such analysis as a 3-way
classification task, where the labels are positive, negative and neutral. The
second way is to first detect subjectivity in the sentence to split opinionated
texts from those un-opinionated texts, then classify those subjective texts with
one of two labels(positive or negative).
The challenge of sentence level analysis is that each individual sentence is
semantically and syntactically connected with other parts of the text. There-
fore, this task requires both local and global contextual information. Yang [8]
analyzes product review on sentence level and addresses this challenge suc-
cessfully.
6
• Aspect / feature /entity level
Unlike document or sentence level analysis, aspect level analysis explores what
the holder likes or hates about the target. The tasks of such fine-grained
analysis is three folded [9]: (1) extracting features of target, (2) determining
feature-wise polarity, (3) summarizing the overall evaluation. Aspect level
sentiment analysis is one the most challenging tasks compared to other level
of analysis.
Besides, the research can also be conducted on phrase level [10], clause level
[11] or word level [12]. Some work also dives into comparative opinions [13] where
more than one targets are compared, unlike regular opinions where only one single
target in each text is evaluated.
Twitter sentiment analysis falls into document level. However, since Twitter
allows a maximum of 140 characters3, each tweet status tends to be very short.
Usually a tweet contains only one simple sentence or just several words. Therefore,
twitter sentiment analysis also calls for a wide variety of strategies utilized on other
levels of analysis.
2.1.3 Tasks
Major sentiment analysis tasks are defined by the sentiment component it con-
cerns. With the help of modern technology, researchers has been widely conducted
ranging from holder/target detection, sentiment classification, aspect extraction,
opinion spam detection etc. In our work, we will focus on document-level sentiment
classification. Specifically, given a tweet post, we will look into different methods of
assigning a polarity label.
2.2 Sentiment Classification
In this section, we will survey popular resources and methodologies used in
sentiment classification. By default, the task refers to document-level sentiment
3https://dev.twitter.com/overview/api/counting-characters(Date Last Accessed, March 29,2016)
7
classification where a whole document is regarded as an information unit. An as-
sumption made by researchers in this field is that the whole document under study
contains consistent sentiment polarity towards a single entity by a single holder.
Many types of reviews are a great example where the assumption holds true. For
tweet data, it is also true because tweets are usually short. It is not natural for user
to include complicated information in a single tweet. The methods can be generally
categorized into three classes: lexicon-based, machine learning-based and rule-based
methods.
2.2.1 Lexicon-Based Methods
2.2.1.1 Sentiment Lexicon
Sentiment lexicon refers to a list of words or phrases that conveys positive
or negative polarity information. Lexicon is very important resource in sentiment
analysis. It provides sentiment information about the smallest linguistic unit. Even
machine-learning based methods can rely on sentiment lexicon in feature engineer-
ing. Proper use of well-designed lexicon will improve the performance of sentiment
analysis system. In this part, we will introduce most popular lexicons used both in
the industry and academia. An overview of methods to compile customized senti-
ment lexicon is also provided.
MPQA subjective lexicon[10] is part of MPQA Opinion Corpus4. The lexicon
is made available under the terms of GNU License. Each entry represents a word and
its length, strength, Part-of-Speech and polarity. It provides a very comprehensive
amount of information which has implications for various fields of study.
SentiWordNet [14] adds real-value sentiment scores to each synset of WordNet
to denote its sentiment polarity (positive, negative and objective). Besides, Part-of-
Speech, context information is also incorporated. One advantage of SentiWordNet
is that it uses semantic resource to enhance the structure of the lexicon. Another
advantage is that it assigns both positive and negative scores to a single word.
General Inquirer5 [15] is an approach for computer-assisted text analysis. It
annotates each word as either positive or negative together with a whole series of
4http://mpqa.cs.pitt.edu/(Date Last Accessed, March 29, 2016)5http://www.wjh.harvard.edu/ inquirer/(Date Last Accessed, March 29, 2016)
8
very rich linguistic, semantic, syntactic and pragmatic information.
VADER Sentiment Lexicon6 [16] is a comprehensive list of “gold-standard”
sentiment words especially applicable to micro-blog and other social network text
data. Providing both polarity and intensity, VADER is validated by human ex-
perts. Besides common dictionary words, it also gives information on emoticons,
slang(“nah”,“meh” etc.) and acronyms(“LOL”, “LMAO” etc.).
Bing Lius lexicon7 [9] is one of the most popular sentiment lexicon for English
language. It contains 2006 positive words and 4783 negative words. The lexicon
excels at practical tasks because it contains misspelling, slang and web-language
variants of entries.
Aside from those lexicons mentioned in previous part, researchers tend to build
customized lexicons and tailor them according to their need. Two type of approaches
are known: dictionary-based and corpus-based.
Dictionary-based approaches make use of lexical databases like WordNet to
expand a manually created seed set. The automatic expansion will explore pair-
wise word relations and generate a lexicon of proper size. The first work of such
propagation is used in [9]. An extension of this method is [17] where the results
of propagation are further pruned and sentiment strength is assigned to each work
using probabilistic methods.
Although dictionary based approached can generate large number of sentiment
words, those words are usually context and domain independent. Corpus-based ap-
proaches can usually solve such kind of problems. The first work is [18] where
linguistic connectives are utilized to determine the polarity of adjectives. The foun-
dation of this work is the “sentiment consistency” of natural language where people
tend to use “AND” to combine words with similar semantic orientation, e.g. “beau-
tiful and smart” is legitimate English phrase while “beautiful and disgusting” is
not likely to be used in real-world language. An extension of this method is [19].
In this work, the author explores the inter-sentential and intra-sentential sentiment
consistency. This study is proved useful in generating domain-dependent sentiment
6https://github.com/cjhutto/vaderSentiment (Date Last Accessed, March 29, 2016)7https://www.cs.uic.edu/ liub/FBS/sentiment-analysis.html (Date Last Accessed, March 29,
2016)
9
words.
In our work, we built on top of popular existing lexicons and proposed cus-
tomized scoring functions.
2.2.1.2 Lexicon-Based Classification Algorithms
The motivation behind lexicon-based classification algorithms is that the senti-
ment of a document is determined by the dominant components (words or phrases).
The basic schemes include majority voting, document scoring with thresholding and
simple word counting [20].
Lexicon based methods usually provide a baseline for further study. Recently
there has been a trend of using ensemble learning with multiple weak lexicon-based
classifiers. Augustyniak et. al. [21] use a variety of lexicon-based weak classifiers
and a C3.4 decision tree as strong classifier. The lexicon extraction method is called
Frequentiment [20] and it is proved 3 to 5 times faster than supervised learning.
While this is very informative and promising, no similar known work has been
conducted to test its effectiveness in English language text.
In our work, we would apply two approcahes on twitter data, Simple Word
Count and Feature Scoring. The detailed description will be discussed in the next
chapter.
2.2.2 Machine Learning-Based Methods
Sentiment classification, by its nature, is a type of two-way text categorization
task. Text categorization usually classifies data into several pre-defined categories.
It is a well-studied field with very mature solutions and applications. The majority of
research in both text categorization and sentiment analysis fall into machine learning
based methodology. In this section, We will briefly overview both supervised and
unsupervised methods.
2.2.2.1 Supervised Learning Methods
Model The first work using machine learning for sentiment analysis is [22]. The
models experimented in this work has been widely used later, namely Naive Bayes
[23], Maximum Entropy and Support Vector Machines [24, 25, 26]. Pang [27]
10
proposed a minimum cuts algorithm to incorporating cross-sentence constraints
an improve efficiency. Li [28] built a framework based on Conditional Random
Fields(CRFs) which is capable of employing joint features for review sentences.
Feature Ever since Pang [22], algorithms and features have been actively de-
veloped and applied in sentiment analysis. Those features include uni-gram and
n-gram term frequency, sentiment words, rules, word position, length measures etc
[4]. Among all features, rich linguistic features have been used such as Part-of-
Speech [24], syntactic structures [28], valence shifters [26], semantic relations [29]
etc.
2.2.2.2 Unsupervised Learning Methods
Using dominance of sentiment words for sentiment classification starts from
Turney [7]. To determine sentiment polarity of a document, the algorithm needs
the following steps:
1. Extract phrases using a manually-created template list.
2. Estimate the sentiment orientation of the extracted phrases using pairwise
mutual information(PMI) approximated with the assistance of a search engine.
3. Compute the sentiment orientation of a whole document and determine the
polarity with a threshold.
2.2.3 Rule-Based Methods
The first automatic text categorization systems relied heavily on knowledge
engineering techniques [30], where a set of human-created logical rules would be
applied. Building such an expert system is usually labor-intensive, time-consuming
and expensive.
The study of sentiment analysis emerges after text categorization became a
nearly “solved-problem”. Therefore most research pursues a machine learning-based
methodology. There are very few pure rule-based methods or systems that we know
of. Most rules are incorporated in lexicon-based systems to improve performance.
11
VADER [16] is a rule-based model with rich lexical features. It aims at sentiment
analysis in micro-blog data and achieves effective and generalizable results compared
to other state-of-the-art methods.
In our work, we have also incorporated simple linguistic rules which address
issues that lexicon-based classifiers fails to handle successfully.
It is a convention for sentiment analysis researchers to categorize methods as
“lexicon-based” and “machine learning-based”. Conceptually most of the lexicon-
based methods can be regarded as “unsupervised” or “semi-supervised” learning
methods. Taboada [31] is the most comprehensive work which uses sentiment lexicon
and incorporates intensification and negation to achieve consistent across-domain
performance.
In this work, we first focused on supervised-learning methods and lexicon-
based methods. We will explore the most successful models and features that have
been proven to be effective in the literature. We also covers linguistic rules and
features to see how they can help in the context.
3. Proposed Methods
3.1 Lexicon-Based Methods
3.1.1 Two Approaches
The basic assumption of lexicon-based methods is that the sentiment of a
document is determined by its dominant sentiment words. For simplicity, the “word”
in our word may refer to both uni-gram word or phrase in linguistic scenario. There
are two approaches to calculate such “dominance”.
Simple Word Count(SWC) Given a sentiment lexicon l, a document d =
{w1, w2, ..., wn} where wi(1 ≤ i ≤ n) represents the ith word in the document.
Let pos(l, d) denotes the occurrence of positive words in d and neg(l, d) denotes the
occurrence of negative words in d. The overall sentiment word sum of the document
sum(d) is calculated as:
sum(l, d) = pos(l, d)− neg(l, d) (3.1)
The sentiment orientation of d (1 denoting “positive” and −1 denoting “negative”)
can be defined as:
sSWC(l, d) =
1, sum(l, d) > 0,
−1, sum(l, d) < 0,
RC(d), otherwise.
(3.2)
To fit in our problem, we assign a random label to d when the sentiment word sum
is 0.
12
13
Feature Scoring(FS) Given a sentiment lexicon l, the socring function of a fea-
ture f maps a feature to a real-valued number where:
score(l, f)
> 0, if positive,
= 0, if neutral,
< 0, if negative.
(3.3)
The scoring function not only defines the polarity (“positive” or “negative”),
but it also depicts the degree of sentiment polarity. This is based on the intuition
that sentiment features have degrees. Suppose we extract sentiment words as fea-
tures, words like “good”, “great” , “awesome” etc. can denote different level of
positiveness and words like “bad”, “aweful”, “horrible” etc. can denote different
level of negativeness.
Given document d = {f1, f2, ..., fn} where fi(1 ≤ i ≤ n) represents the ith
feature in d, the overall sentiment sum of d can be calculated as:
sum(l, d) =n∑
i=1
score(l, fi). (3.4)
By selecting a threshold δ ≈ 0, the sentiment orientation of d can be defined as:
sfc(d) =
1, sum(l, d) > δ,
RC(d),−|δ| ≤ sum(l, d) ≤ |δ|,
−1, sum(l, d) < δ.
(3.5)
To fit in our problem, we assign a random label to d when the sum value is between
threshold interval.
Simple Word Count is a special case of Feature Scoring where:
• Word is extracted as a feature,
• Each positive word is scored 1.0,
• Each negative word is scored −1.0,
14
• δ is slected as 0.0.
For Simple Word Count method, the key is to create a lexicon with polarity
attached to each word entry. For Feature Scoring method, the key is to extract
features, define an effective scoring function and find an accurate threshold δ.
3.1.2 Basic Lexicon-Based Methods
MPQA Subjectivity Lexicon(MPQA) Here are two examples of MPQA entry
for words “abandoned” and “impassive”.
Table 3.1: MPQA Example Entries
Word Annotationabandoned type=weaksubj len=1 word1=abandoned
pos1=adj stemmed1=n priorpolar-ity=negative
impassive type=weaksubj len=1 word1=impassivepos1=adj stemmed1=n polarity=negativepriorpolarity=weakneg
As depicted by Table 3.1, MPQA lexicon annotates words with their type,
length, string, Part-of-Speech and other features. Among those features, we only
consider “priorpolarity” and “polarity”. The former denotes the word’s sentiment
orientation in context and the latter denotes the word’s context-independent polar-
ity.
We compiled a polarized lexicon with MPQA lexicon. The polarity of a word
is defined by its polarity in application, otherwise it is defined by its prior-polarity.
General Inquirer(GI) For each entry, General Inquier annotates at most 186
properties, which makes it a perfect tool for rich linguistic feature extraction. In
our task, we built a lexicon using the “Positive” and ‘Negative” properties.
Bing Liu’s Lexicon Bing Liu’s lexicon has already categorized words into “pos-
itive” and “negative” classes. We directly copied the lexicon with small amount of
encoding conversion.
15
SentiWordNet(SWN) SentiWordNet provides users with clusters of synony-
mous words ready to be used in sentiment analysis tasks. Sample entries can be
found in Table 3.2.
Table 3.2: Sample SentiWordNet Entries
POS ID PosScore NegScore SynsetTerms Glossa 00019131 0.625 0 accessible#1 capable of being
reached; “a townaccessible by rail”
a 00019731 0.125 0.125 ready to hand#1handy#1
easy to reach; “founda handy spot for thecan opener”
n 15247410 0 0 ephemera#1 something transitory;lasting a day
v 02771756 0 0 run dry#1dry out#2
become empty of wa-ter; “The river runsdry in the summer”
From the Table 3.2 we can see:
• SentiWordNet provides real-value positive score and negative score(PosScore
and NegScore) for each entry.
• SentiWordNet contains not only uni-gram words, but also multiwords expressions(n-
grams).
• SenttiWordNet clusters words with similar sentiment orientation together into
different sets. For example, “run dry” and “dry out” are in the same set.
Based on our observation, such features can help determine word’s polarity,
extract n-gram feature and design scoring function.
With the real-valued PosScore and NegScore for each entry, we can determine
a word’s polarity and sentiment degree. Given a SentiWordNet entry word w, the
polarity can be determined as follows:
polswn(w) =
1, if PosScore(w) > NegScore(w),
−1, if PosScore(w) < NegScore(w),
0, otherwise.
(3.6)
16
We excluded words with polarity of 0 in our lexicon.
We designed a simple scoring function, the first one directly uses the scores
provided by SentiWordNet. Given a word w, the scoring function is as follows:
scoreswn(w, l) = PosScore(w)−NegScore(w). (3.7)
Besides, we can use SentiWordNet for n-gram feature extraction. In this way,
not only can we handle sentiment words, we can also address phrases which are
essential in opinion expressing.
Vader Sentiment Lexicon(VSL) VADER is a lexicon with both polarity and
intensity information attached to each entry. The basic structure is shown in Table
3.3.
Table 3.3: Example VADER Sentiment Lexicon
Entry Intensity Std. Human Evaluation Vectoraccomplish 1.8 0.6 [1, 2, 3, 2, 2, 2, 1, 1, 2, 2]
dangers -2.2 0.87178 [-1, -1, -2, -4, -2, -3, -3, -2, -2, -2]lmao 2.0 1.18322 [3, 0, 3, 0, 3, 1, 3, 2, 3, 2]=] 1.6 0.8 [2, 1, 3, 1, 1, 1, 2, 3, 1, 1]
The intensity of each word is calculated by averaging human evaluation vector
gathered from ten experts’ annotation. The lexicon only contains entries with a
standard deviation less than 2.5.
Based on the information we observed, the polarity of each word entry w can
be determined as:
polvader(w) =
1, if Intensity(w) > 0,
−1, if Intensity(w) < 0.(3.8)
The words with intensity of 0 have already been removed by the author.
VADER provides uni-gram word entries as feature. To be used for feature
scoring algorithm, two scoring functions can be designed based on the VADER
17
lexicon. The first one uses directly the intensity as sentiment score:
scorevader0(w, l) = Intensity(w). (3.9)
The other scoring function used a normalized intensity:
scorevader1(w, l) =Intensity(w)
d. (3.10)
Where d is the range of intensity column of the lexicon. GivenW = {w1, w2, ..., wi, ..., wn}where 1 ≤ i ≤ n and n is the size of VADER lexicon, d can be calculated as:
d = MAX(Intensity(W ))−MIN(Intensity(W )). (3.11)
This way VADER is prepared to be used in lexicon-based sentiment classification
tasks.
3.1.3 Linguistic Rules
Lexicon-based classification is an simple yet useful idea but it fails to cover
many language phenomena. To better scale lexicon-based algorithm to real-world
text data, strategies need to be devised. The detailed linguistic and logical analysis
is beyond the scope of this work. In this section, we will mainly introduce three
kind of rules and corresponding solutions that later proved effective by our work.
3.1.3.1 Negation
Negation is a common device in a natural language to reverse the truth value
of one or several unit(s). It is usually implemented using adverbs like “not”, “never”
etc.
In the following example, “won” is supposed to be a positive sentiment word
but with “never”, the overall sentiment polarity is reversed from “positive” to “neg-
ative”.
RT @coolknifeguy: Leo has never won two Oscars :(
18
Another example below demonstrates how negation word “never” has reversed the
negativeness of “bored” and rendered the whole tweet a positive sentiment.
I love @BigBang CBS Watching reruns never get bored of the big
bang theory :)
Based on our observation, we made the following assumptions:
• If a tweet contains negative expression, this tweet entails negation.
• Negation would reverse the polarity of sentiment features in the sentence.
• Negation would change the sign of the featuring sentiment score.
To implement this rule, we first collected a set of negation expressions and
their variants as shown in Table A.1. Then we defined all non-alphanumerical, non-
blank space characters, end-of-file and start-of-file as sentence delimiters. Thus we
could implement the following negation inference rules:
Negation Inference Rule(NIR): For a tweet t = {w1, w2, ..., wi, ..., wn}where n is the text length and 1 ≤ i ≤ n, we defines a sliding window of
size k. If wi is a negation expression, for any sentiment word wj where
|i − j| ≤ k within a sentence, the polarity and the sign of sentiment
score given any sentiment lexicon l is reversed(i.e. from “positive” to
“negative” or the other way around).
3.1.3.2 Valence Shifter
Valence shifter refers to the device in natural language to intensify or weaken
the degree per certain property of specific language unit(s). A typical example is
“very”, “more”, “fairly” etc.
The following example demonstrates the valence shifter “very” intensifies the
sentiment degree of “disappointing” and outweighs the positiveness of sentiment
word “loyal”. Eventually the tweet should be assigned a label of “negative”.
Very disappointing how @AudiMinneapolis treats a loyal customer.
:(
19
From the examples, we made the following assumptions:
• If a tweet contains valence shifter expressions, this tweet entails valence shift
phenomenon.
• Valence shifter can intensify or weaken the sentiment degree of the proceeding
sentiment words.
• Valence shifter does not affect the sentiment degree before the valence shifter
expressions.
• There can be multiple valence shifter expressions, but we will only use the first
one and remove the others.
To implement this rule, we first manually collected a set of valence shifter
expressions shown in Table A.1. Then we defined the same sentence delimiters as
in negation inference rule. Thus we had the following valence shift rule:
Valence Shifter Rule(VSR): For a tweet t = {w1, w2, ..., wi, ..., wn}where n is the text length and 1 ≤ i ≤ n, we defines a sliding window
of size k. If wi is the first valence shifter expression, for any sentiment
word wj where 0 ≤ j− i ≤ k within a sentence, the polarity remains the
same and the sentiment score given any sentiment lexicon l is intensified
as αscore(wj, l), (α ≥ 1) or weakened as βscore(wj, l), (0 ≤ β ≤ 1).
3.1.3.3 Contrast
Contrast is the mechanism in a language which joins two or more smaller units
with opposite properties into a bigger unit. The language units can be clauses, sen-
tences, paragraphs etc. Usually, contrast is substantiated using “but”, “although”,
“however” etc.
The following example tweet consists of two clauses with a contrasting con-
junction “but”. While the first clause, with sentiment words “impressive”, “love”
etc., appears very positive at first glance. However, the semantic orientation is
determined by the second clause (also the “main clause”).
20
@Sprite37 I’d rly love to play DS3 bc Bloodborne’s combat looked
so impressive, but sadly I have no PS4 :(
Another example tweet also consists of two clauses joined by a conjunction
“but”. The polarity of second clause is vague because there is no obvious sentiment
words. However, with the help of the first positive clause, we can reverse the polarity
and infer the second clause being negative. Therefore, the whole tweet should be
labeled as negative.
ok marco rubio is kinda hot but he’s a republican :(
Based on our observation, we made the following assumptions:
• A tweet can consist of several clauses or sentences joined by contrasting con-
junctions.
• If a tweet contains contrasting conjunction expressions, then the string se-
quence before the conjunction is regarded as the secondary clause, the string
sequence proceeding the conjunction is regarded as the second clause.
• The over-all polarity of the tweet is consistent with the main clause.
• The polarity can be directly determined by the main clause or inferred from
the secondary clause by reversing its polarity.
• There can be multiple contrast conjunctions in a tweet, but we only handle
the secondary one.
To implement this rule, we first collected a set of contrast conjunction expres-
sions shown in Table A.1. Then we had the following contrast inference rule:
Contrast Inference Rule(CIR): Given a tweet t = {w1, w2, ..., wi, ..., wn}where n is the text length and 1 ≤ i ≤ n. Let’s assume that wi is
the first contrast conjunction expression. Then, the tweet can be di-
vided into two clauses: c0 = {w1, w2, ..., wi−1}(secondary clause) and
c1 = {wi+1, wi+2, ..., wn}(main clause). Sentiment polarity of t is consis-
tent with c1 and is the reverse of c0.
21
In our implementation, we prioritized main clause over the secondary clause. The
algorithm is shown as follows:Data: Tweet t with Contrast Conjunction t, Lexicon l
Result: Polarity Label for Given Tweet t
[c0, c1]← split t by contrast conjunction ;
if score(c1, l) 6= 0 then
pol(t, l)← pol(c1); /* Use the main clause */
else if score(c1, l) = 0 then
pol(t, l)← −pol(c0); /* Use the first clause */
else
pol(t, l)← RC(t); /* Assign a randome label */
end
Algorithm 1: Contrast Inference Rule(CIR) Algorithm
3.1.3.4 Linguistic Inference Rule
Based on our discussion from previous parts, our linguistic rules will be re-
sponsible for handling negation, valence shift and contrast in tweet text. We sim-
plify the problem even further by assuming the importance order of those rules is
CIR > NIR > V SR. The pipeline analysis is shown as follows:Data: Tweet t, Lexicon l
Result: Polarity Label for t
if t entails contrast then
classify t using CIR rules ;
else if t entails negation then
classify t using NIR rules ;
else if t entails valence shift then
classify t using VSR rules ;
else
classify t using standard lexicon-based algorithms ;
end
Algorithm 2: Linguistic Inference Rule(LIR) Algorithm
Despite the fact that LIR Algorithm simplifies the process and avoids some
edge cases where multiple linguistic phenomenons coexist in a tweet, it is helpful
22
in saving us from involved logical inference and knowledge engineering. As demon-
strated by the experiments in later chapter, the rules and algorithms helped improve
the system to certain extent.
3.2 Machine Learning-Based Methods
3.2.1 Feature
Twitter data are just sequences of string characters. To use automatic clas-
sification algorithms, special representation must be used to make it suitable for
computation. In our work, we used two types of presentation: Bag-of-Words N-
grams and linguistic features.
3.2.1.1 N-Grams
Bag-of-Words is one of the most successful feature representation in text cat-
egorization tasks. Under this model, text input is represented as a vector of tokens
with their corresponding numeric values.
To precess piece of tweet from raw text to bag-of-words representation, the
following steps would be taken:
• Tokenize the input text from character sequence into tokens.
• Convert the token strings to lower-cases.
• Remove stop words(function words like “the”, “of”, “a” etc. and punctua-
tions).
• Convert tokens from string to integer feature indexes.
• Convert feature sequences to feature vectors by certain computation.
The computation for convert feature sequence to vector varies and there are
many sophisticated methods, such as presence(0 or 1), frequency(word count), IDFIn-
verse Document Frequency[32], TF-IDF [33] etc.
In Bag-of-Words model, N-gram refers to a slice of a longer feature consist-
ing of n contiguous tokens. N-gram was originally used in language modeling [34]
23
by researchers interested in the probability of a word given its uses in the given
documents. It is widely used in information retrieval and text mining. The most
common ones are uni-grams, bi-grams, tri-grams or even higher grams.
Consider the following tweet:
No, Adele. I love you, but you’re not going to make me cry today. Next!
lol
Table 3.4 shows the feature vector of unigram, bi-gram and tri-gram feature
vectors(frequency) of the tweet above together with other tweets in our data-set.
To display properly, we have skipped those features with a value of 0. With the size
of n-gram grows, the feature space expand rapidly and each vector becomes vastly
sparse.
Table 3.4: N-Gram Examples
N-gram Feature Vectoruni-gram love(92)=1.0, make(136)=1.0, cry(140)=1.0, adele(229)=1.0, to-
day(362)=1.0, lol(644)=1.0bi-gram you re(63)=1.0, i love(218)=1.0, to make(353)=1.0,
make me(354)=1.0, me cry(364)=1.0, going to(802)=1.0,adele i(976)=1.0, not going(1790)=1.0, no adele(2075)=1.0,love you(2087)=1.0, you but(7417)=1.0, but you(7418)=1.0,re not(7419)=1.0, cry today(7420)=1.0, today next(7421)=1.0,next lol(7422)=1.0
tri-gram to make me(366)=1.0, i love you(2380)=1.0,not going to(3265)=1.0, no adele i(8957)=1.0,adele i love(8958)=1.0, love you but(8959)=1.0,you but you(8960)=1.0, but you re(8961)=1.0,you re not(8962)=1.0, re not going(8963)=1.0, go-ing to make(8964)=1.0, make me cry(8965)=1.0,me cry today(8966)=1.0, cry today next(8967)=1.0, to-day next lol(8968)=1.0
In our work, we compared the effectiveness of n-gram as features for tweet sen-
timent classification. Our major tool for pre-processing, cleaning and computation
is MALLET8 [35].
8http://mallet.cs.umass.edu/ (Date Last Accessed, March, 29 2016)
24
3.2.1.2 Linguistic Features
Linguistic features refers to those features incorporating rich linguistic anno-
tation, including Part-of-Speech, semantic relations, syntactic structures etc. Such
features usually relies on highly-accurate taggers(and parsers). Rich linguistic fea-
tures are essential for deep natural language understanding.
Given the following tweet “my mom won’t stop calling me Justin bieber since
I got my hair cut”, the Part-of-Speech tagging is shown in Figure 3.1.
Figure 3.1: Sample Part-of-Speech Tagging
For our task, we used Stanford CoreNLP toolkit [36]. It is a highly optimized
Maximum Entropy tagger with success in cross-domain natural languagr processing
tasks. The tag set Stanford CoreNLP used is from PennTreeBank9. Those tags are
the Part-of-Speech of words which denotes the syntactic and semantic function of a
word. For example, “NN” refers to singular nouns (“mom”, “hair”), “PRP$” refers
to possessive pronoun(“my”), “VBD” refers to verb of past tense etc. More details
on the tag set can be found on PennTreeBank website.
Part-of-Speech is a very important feature of natural language. It helps
when two words share the same form but have totally different meaning, such as
“play(noun or verb)”, “book(noun or verb)” etc. With an accurate label of linguis-
tic role attached to a feature, the ambiguity of natural language is expected to be
solved to a great extent.
3.2.2 Model
For our experiment, we have chosen three state-of-the-art models in classifica-
tion tasks. Since those models are well-studied, we will focus on each model’s ap-
plication in sentiment classification and it’s configuration in our experiment. Naive
Bayes(NB) is a simple classifier that is based on Bayes rule and conditional indepen-
dence assumption. It assigns a class label with the maximum conditional probability
9http://www.cis.upenn.edu/ treebank/ (Date Last Accessed, March 29, 2016)
25
given a training set. Maximum Entropy(ME) is a highly effective classifier model
that iteratively searches for and optimizes feature-weight parameters to maximize
the likelihood of the training set. Support Vector Machines(SVM) model aims find-
ing a decision surface which maximize margin between two classes.
The models we chose are based on the first sentiment analysis work using
machine learning [22]. The toolkits we have used for our experiment for machine
learning implementation are MALLET(Naive Bayes and Maximum Entropy) [35]
and LibSVM(Support Vector Machines) [37].
3.3 Evaluation
We used the standard matrices in text categorization to evaluate various clas-
sifiers. Suppose we have a set of classification results, in each class of n documents,
cij (0 ≤ i ≤ n−1, 0 ≤ i ≤ n−1) denotes the number of instances where a document
in the ith class is categorized as belonging to the jth class. The per-class measures
can be calculated as follows:
Precision Assessment of what fraction of instances are correctly classified.
p =cii∑j
cji
. (3.12)
Recall Assessment of what fraction of correct instances are classified.
r =cii∑j
cij
(3.13)
F-measure Assessment combining both precision and recall. It helps researchers
achieves a balance through the trade-off between precision and recall. The most
common F-measure is F1 measure:
F1 =2(p× r)(p+ r)
(3.14)
26
Accuracy Assessment of what fraction of instances are correctly classified across
all classes.
a =
∑i
cii∑j
∑i
cij
(3.15)
Based on these per-class measurements, we have two types of averaging eval-
uation methods: macro average and micro average.
Micro-average Create a contingency table for all classes then compute the pre-
cision, recall and F1 measure of the whole data-set as one “big class”.
Macro-average Compute the precision, recall and F-measure for each class, then
average the sum over number of classes.
In our work, we used macro-averaging measurements to evaluate the perfor-
mance of our algorithms.
4. Experiment
4.1 Data-set
The data-set we created for our experiment are collected from newly posted
twitter status(tweets) from February, 2016. The web crawler is Twitter4J10, a third-
party Java tool for Twitter API.
4.1.1 Motivation for Data Gathering
Normally data-set for sentiment analysis is manually annotated by domain
experts, researchers and linguists. However, hand-labeled data tends to be expensive
and time-consuming. To gather twitter data, we combined Naturally Annotated
Big Data with manual cleaning. Before a detailed description, we would present our
motivation here.
Naturally Annotated Big Data(NADB) [38] refers to the data generated from
“natural user behavior”. For example, a user of TripAdvisor website might post a
status saying:
I like Tokyo, Beijing, Shanghai and other cities.
By analyzing this sentence, it is very easy for computer programs to extract cer-
tain “is-kind-of” relation: Beijing, Shanghai and Tokyo are cities. Such kinds of
phenomenons are ubiquitous in web-pages, blogs, tweets and other kinds of textual
data.
The natural annotation we used in our data gathering is emoticon11. Emoti-
cons are the tokens representing facial expressions using punctuation marks and
alphanumeric characters. Our assumption is that users will use “happy” emoticons
to express positive sentiment and “sad” emoticons to express negative sentiment.
In very rare case would the opposite situation happen.
10http://twitter4j.org/en/index.html (Date Last Accessed, March 21, 2016)11The full list of Twitter emoticons can be found at: http://emojipedia.org/twitter/ (Date Last
Accessed, March 10, 2016)
27
28
Another assumption is that if a user mentions a key-word in a tweet, the
tweet is about the topic represented by this keyword. We have chosen 10 topic lists
containing popular entity words/phrases in hope that those topic key word(s) can
help gather topic-specific data via query function provided by Twitter API.
4.1.2 Raw Data
We first selected a list of several “positive” emoticons and “sad” emoticons as
shown in table 4.1. The meaning of those emoticons are deterministic with least
possible ambiguity.
Table 4.1: Positive and Negative Emoticons
Polarity EmoticonsPositive :), ;), :D, :-), :-DNegative :(, :-(, :’(, :’-(, D:
Second, we collected nine lists of key word(s) pertaining to nine topics. Those
key word(s) are names of entities(celebrity, commercial brand, title of amovie or a
TV show etc.) which have been warmly discussed either by mass media or social
network users. The detailed description of topics and examples are shown is provided
in Table 4.2.
After the emoticon list and topic lists are prepared, we are ready to crawl
topic-related tweets using Twitter API. For each key word, a query of “key word
+ emoticon” will return a collection of results belonging to the specific topic with
certain sentiment polarity. For example, a query with “Taylor Swift” with “:)” will
return a collection of tweets under the topic “artist” with polarity “positive”, while
a query of “AngularJS” with “:(” will return a collection of tweets under the topic
“technology” with the polarity “negative”. We iteratively went through all nine
topic lists and crawled a positive tweet set and a negative tweet set for each topic.
To end up the data-crawling process, we collected another two tweet sets with
only positive or negative emoticon string literals as queries. In this way, we collected
data for a general topic with no specific domain.
After these steps, we have successfully built a raw data pool with positive and
negative twitter data for ten topics(including one with general-domain topic).
29
Table 4.2: Topic Key Word(s)
# Topic Description Examples1 Artist Names of popular mu-
sicians or actorsTaylor Swift, Lady Gaga,Alessia Cara.
2 Automobile Brand names of popu-lar cars
Aston Martin, Audi, BMW,Buick.
3 Game Names of populargames on all plat-forms
Batman: Arkham Knight,Halo 5: Guardians.
4 IT Company Names of famous ITcompanies
Oracle, SAP, Fujitsu, Ac-centure.
5 Movie Names of popularmovies
Mad Max: Fury Road,Jurassic World, Furious 7.
6 Politician Names of 2016 presi-dential candidates
Hillary Clinton, DonaldTrump, Ted Cruz.
7 Software Names of popular soft-wares across all plat-forms
Yik Yak, Instagram, Zillow,Fitbit.
8 Technology Names of popular soft-ware engineering tech-nologies
AngularJS, Java Spring,MeteorJS, CakePHP.
9 TV Show Names of popular TVshow on Netflix.com
Game of Thrones, Grey’sAnatomy, Vikings.
4.1.3 Cleaning Data
Based on the raw data collected as described in the previous section, our data-
cleaning processes are as follows:
1. Remove non-English tweets.
2. Remove blank symbols(new lines, space, tab etc.).
3. Remove Unicode characters.
4. Remove tiny links12, retweet key word(s)(“RT”) and usernames(“@username”)
generated by Twitter system.
5. Remove emoticons used in the data-crawling stage.
6. Randomly select 1,000 positive tweets and 1,000 negative tweets for each topic.
12https://support.twitter.com/articles/78124 (Date Last Accessed, March 10, 2016)
30
The data-set13 consists of ten topics with 1,000 records in positive and negative
polarity. This finalizes our data preparation for the experiments.
4.2 Setting
In the experiment, we explored three types of methods introduced in the pre-
vious chapter: lexicon-based methods, rule-based methods and machine learning-
based methods. This section will introduce the order and detailed configurations.
Baseline methods We used two methods as our general baseline against which
we can improve our system.
1. Random Classifier(RC) Given a tweet, a random class label from the label
set would be assigned.
2. Most Frequent Classifier(MFC) Given a tweet, a class label with maxi-
mum occurrence on the training corpus would be assigned.
Lexicon-Based Methods Lexicon-based methods that we experimented with in
our sentiment lexicons and simple word count or feature scoring approaches are:
1. MPQA-SWC MPQA lexicon with Simple Word Count approach.
2. GI-SWC General Inquirer lexicon with Simple Word Count approach.
3. BL-SWC Bing Liu’s lexicon with Simple Word Count approach.
4. SWN-SWC SentiWordNet lexicon with Simple Word Count approach.
5. VSL-SWC Vader Sentiment Lexicon with Simple Word Count approach.
6. SWN-UFS SentiWordNet lexicon with Uni-gram feature scoring function in
equation 3.7.
7. SWN-BFS SentiWordNet lexicon with bi-gram feature scoring function in
equation 3.714.
13The data can be downloaded from: http://homepages.rpi.edu/~yuanb/thesis/thesis.html(Date Last Accessed, March 29, 2016)
14In n-gram, we included uni-gram to n-gram features.
31
8. SWN-TFS SentiWordNet lexicon with tri-gram Feature Scoring function in
equation 3.7.
9. VSL-BFS Vader Lexicon with basic feature scoring function in equation 3.9.
10. VSL-NFS Vader Sentiment Lexicon with normalized feature scoring function
in equation 3.10.
Rule-Based Methods We incorporated our Linguistic Rule Inference method in
algorithm 2 with top three lexicon-based methods.
1. BL-SWC-LIR Bing Liu’s lexicon with Simple Word Count approach with
LIR algorithm.
2. VSL-NFS-LIR Vader Lexicon with normalized featuring scoring function
and LIR algorithm.
3. VSL-BFS-LIR Vader Lexicon with basic featuring scoring function and LIR
algorithm.
Machine Learning-Based Methods We incorporated two sets of features in
three models. N-gram bag-of-words(BOW) feature and deeper linguistic feature.
N-gram bag-of-words(BOW) feature includes:
1. NB-NGRAM Naive Bayes classifier with uni-gram to 8-gram BOW features.
2. ME-NGRAM Maximum Entropy classifier with uni-gram to 8-gram BOW
features.
3. SVM-NGRAM Support Vector Machines classifier with uni-gram to 8-gram
BOW features.
For linguistic features, we chose bi-gram15 with Part-of-Speech(POS) features
that include the following methods.
1. NB-POS Naive Bayes classifier with bi-gram Part-of-Speech features.
15This is because bi-gram performs best in BOW experiment stage.
32
2. ME-POS Maximum Entropy classifier with bi-gram Part-of-Speech features.
3. SVM-POS Support Vector Machines classifier with bi-gram Part-of-Speech
features.
5. Discussion
5.1 Baseline
The results of two baseline algorithms are shown in Figure 5.1a and 5.1b
(a) Random Classifier(RC) (b) Most Frequent Classifier(MFC)
Figure 5.1: Results of Baseline Algorithms
RC achieved results of around 0.5000 across ten domains in all measurements.
This follows our expectation because the data-set is balanced across classes and
domains. MFC achieved a comparable accuracy and recall with RC. The recall is
always 0.5000. However, the Precision is very high(around 0.7500) while the F1 value
is low(around 0.3500). This also follows the trade-off between two measurements.
The average performance of two algorithms across all domains are shown in Table
5.1.
Table 5.1: Average Performance of Baseline Algorithms
Classifier Accuracy Precision Recall F1RC 0.5021 0.5021 0.5021 0.50203
MFC 0.5008 0.7504 0.5 0.33365
5.2 Lexicon-Based Methods
The results of ten lexicon-based methods are shown in Figures 5.2. The ten
lexicon-based classifiers demonstrated uneven classification capability across ten top-
ics. For overall accuracy, we can expect a roughly higher performance than baselines.
33
34
However, in terms of F1 measurement, which is another comprehensive evaluation
measurement, the results are mixed.
Table 5.2 shows the best performance for each classifier and Table 5.3 shows
the average performance. For accuracy, the best range is between 0.5700 to 0.6780,
higher than 0.5250 best baseline accuracy by RC. Generally average accuracy is
between 0.5060 and 0.5605. More optimistic results can be expected in terms of
precision, where the best performance exceeds 0.6269 and average precision is also
likely to achive 0.6045. The two measurements proves a generally adequate capabil-
ity for lexicon-based classifiers to predict “correctly” a certain portion of opinionated
tweets.
In terms of recall and F1 value, lexicon-based classifiers vary considerably.
While recall values vacillate from 0.5038 to 0.6489, which is slightly higher than
baseline, the overall F1 value can be as low as 0.3895, which is worse than base-
line and as high as 0.6714, which is satisfactory in certain case. This reveals that
although lexicon-based classifier can generally increase “correctness”, the ability to
“find all” and “win favour on all sides” is unpredictable.
Algorithm Accuracy Macro-Precision Macro-Recall Macro-F1MPQA-SWC 0.644 0.6668 0.6433 0.6307
GI-SWC 0.627 0.6269 0.6268 0.6268BL-SWC 0.678 0.6953 0.6789 0.6714
VSL-SWC 0.625 0.6352 0.6253 0.6249SWN-SWC 0.57 0.6535 0.5662 0.4965SWN-UFS 0.567 0.6964 0.5916 0.513SWN-BFS 0.588 0.6868 0.5916 0.5297SWN-TFS 0.582 0.6735 0.5805 0.5165VSL-BFS 0.644 0.6489 0.6442 0.644VSL-NFS 0.647 0.647 0.6471 0.647
Table 5.2: Best Performance of Lexicon-Based Methods Across Domains
In terms of lexicon, Bing Liu’s sentiment lexicon(BL-SWC) and Vader Senti-
ment Lexicon(VSL-BFS and VSL-NFS) are top three most effective lexicon-based
methods. The former uses Simple Word Count approach and the latter uses Feature
Scoring approach. Possible explanation for their performance is Bing Liu’s lexicon
is specially compiled on Internet corpus and Vader Sentiment Lexicon is also tai-
35
(a) MPQA-SWC (b) GI-SWC
(c) BL-SWC (d) SWN-SWC
(e) VSL-SWC (f) SWN-UFS
(g) SWN-BFS (h) SWN-TFS
(i) VSL-BFS (j) VSL-NFS
Figure 5.2: Results of Lexicon-Based Algorithms
36
Algorithm Accuracy Macro-Precision Macro-Recall Macro-F1MPQA-SWC 0.5095 0.50903 0.51032 0.48397
GI-SWC 0.5427 0.54608 0.54402 0.53706BL-SWC 0.593 0.60451 0.59403 0.58315
VSL-SWC 0.5524 0.55426 0.552 0.54238SWN-SWC 0.5061 0.51134 0.50383 0.3895SWN-UFS 0.5064 0.52256 0.50792 0.39377SWN-BFS 0.506 0.51201 0.50698 0.39186SWN-TFS 0.5117 0.52248 0.50682 0.39473VSL-BFS 0.5536 0.55782 0.55471 0.54536VSL-NFS 0.5605 0.56332 0.55973 0.55079
Table 5.3: Average Performance of Lexicon-Based Methods Across Domains
lored for sentiment analysis over social network data. Those lexicons require least
effort for domain adaptability and are likely to cover more occurrences of real-world
features in Internet language.
5.3 Rule-based Methods
The results of lexicon-based methods with Linguistic Inference Rule(LIR) algo-
rithm is shown in Figure 5.3. We chose top three lexicon-based methods: BL-SWC,
VSL-BFS and VSL-NFS.
As shown in Figures 5.5 and 5.4, LIR algorithm can boost the performance
to certain extent. For average evaluation, VSL-BFS and VSL-NFS all embraced
certain degree of increase in all measurements. For best performance, BL-SWC
and VSL-BFS all get an increase in all measurements and the precision of VSL-
BFS increased by 1%. However, we also see that for BL-SWC, none of the average
measurements increased with LIR. For VSL-BFS, with LIR only achieved a better
precision but accuracy, recall and F1 remains comparable with baseline. One note-
worthy performance is that for both VSL-NFS and BL-SWC methods, the best
precision increased and exceeded 0.7000.
The performance of LIR algorithm depends on both the data composition and
corresponding lexicon-based methods. From our experiment, we can infer that rule-
based methods can help increase precision and accuracy, but in terms of overall
performance, it’s efficiency still needs more examination.
37
(a) BL-SWC-LIR (b) VSL-NFS-LIR
(c) VSL-BFS-LIR
Figure 5.3: Results of Rule-Based Methods
(a) BL-SWC with/without LIR (b) VSL-BFS with/without LIR
(c) VSL-BFS with/without LIR
Figure 5.4: Comparison of Best Performance with LIR Algorithm
38
(a) BL-SWC with/without LIR (b) VSL-BFS with/without LIR
(c) VSL-NFS with/without LIR
Figure 5.5: Comparison of Average Performance with LIR Algorithm
5.4 Machine Learning-Based Methods
We used three state-of-the-art classifiers, namely Naive Bayes(NB), Maximum
Entropy(ME) and Support Vector Machines(SVM) together with two sets of fea-
tures.
The results of machine learning-based classifiers incorporating N-Gram Bag-
of-Words features with N ranging from 1(unigram) to 8 are shown by domain in
Figures 5.6, 5.7 and 5.8.
Generally, machine learning classifiers achieved very inspiring results in evalu-
ation. All of four measurements are very high compared to lexicon-based and rule-
based classifiers. Naive Bayes is one of the the simplest classifier yet it archived at
least 0.8589−0.8774 in average accuracy, 0.8605−0.8798 in precision, 0.8588−0.8774
in recall and 0.8771 − 0.8586 in F1 value. The best performance of NB classifier
could reach over 0.9500 in all measurements. A slightly higher results could be
expected for Maximum Entropy. While the best performance of ME classifier was
roughly the same as with NB, the average performance in all measurements were
supposed to be 1% higher than NB. SVM was the best classifier in overall perfor-
mance. The average measurements reached 0.8600− 0.8900 in general and the best
39
(a) NB-ART-BOW (b) NB-AUT-BOW
(c) NB-GAM-BOW (d) NB-GEN-BOW
(e) NB-ITC-BOW (f) NB-MOV-BOW
(g) NB-POL-BOW (h) NB-SOF-BOW
(i) NB-TEC-BOW (j) NB-TVS-BOW
Figure 5.6: Naive Bayes with N-Gram Bag-of-Words Features
40
(a) ME-ART-BOW (b) ME-AUT-BOW
(c) ME-GAM-BOW (d) ME-GEN-BOW
(e) ME-ITC-BOW (f) ME-MOV-BOW
(g) ME-POL-BOW (h) ME-SOF-BOW
(i) ME-TEC-BOW (j) ME-TVS-BOW
Figure 5.7: Maximum Entropy with N-Gram Bag-of-Words Features
41
(a) SVM-ART-BOW (b) SVM-AUT-BOW
(c) SVM-GAM-BOW (d) SVM-GEN-BOW
(e) SVM-ITC-BOW (f) SVM-MOV-BOW
(g) SVM-POL-BOW (h) SVM-SOF-BOW
(i) SVM-TEC-BOW (j) SVM-TVS-BOW
Figure 5.8: Support Vector Machines with N-Gram Bag-of-Words Features
42
(a) NB-BOW-AVG (b) ME-BOW-AVG
(c) SVM-BOW-AVG
Figure 5.9: Average Performance of N-Gram Bag-of-Words Features
measurements all exceeded 0.9600.
Our Bag-of-Words feature ranges from uni-gram to 8-gram. Based on our
observation, most data-sets reach the best performance with bi-gram as depicted in
Figures 5.6 to 5.8. For NB classifier, 6 out of 10 topics favor bi-gram over others. For
ME and SVM, the numbers are 5 and 8. In terms of average performance, bi-gram
definitely dominated all other n-gram features as depicted in Figure 5.9.
From our experiment, we can conclude that normally uni-gram features is effec-
tive and bi-gram are most effective BOW features for multi-domain twitter sentiment
analysis. Besides, SVM performed best in terms of all common measurements.
To further experiment on machine learning-based classifiers, we incorporated
rich linguistic features-Part-of-Speech(POS). For simplicity, we only conducted ex-
periment with bi-gram features. The results are shown in Figure 5.10. The results
are improved compared to Bag-of-Words features for all three classifiers as depicted
in Figure 5.11. Across all domains, all of four average measurements increased by
approximately 0.0100 to 0.0200. For twitter data, the POS tagging can be both
inefficient and error-prone. The improvement is relatively small considering Part-
of-Speech tagging is time-consuming.
43
(a) NB with POS Features (b) ME with POS Features
(c) SVM with POS Features
Figure 5.10: Average Performance of Machine Learning Classifiers with LinguisticFeatures
(a) NB with POS and BOW Features (b) ME with POS and BOW Features
(c) SVM with POS and BOW Features
Figure 5.11: Comparison of Linguistic and Bag-of-Words Features
44
5.5 Evaluation Revisited
From our discussion, it appears that machine learning-based methods far out-
perform lexicon-based and rule-based methods in almost all evaluation measure-
ments. Even the most simple machine learning model can achieve 35% higher score
than a fine-tuned lexicon-based or rule-based classifiers. However, a closer exami-
nation of our problem might raise a new question: is a classifier with high accuracy
really accurate?
For classification problems with relatively “strict” theoretical grounding and
boundaries, such as text categorization, protein functional categorization, face de-
tection etc., it is true that higher accuracy means better systems. However, the
problem of sentiment analysis entails large amount of subjectivity. Practically it is
hard to quantify the intensity of emotion or opinion. For single words like “good”
and “excellent”, we could conclude that the former is not as “strong” as the latter.
But when it comes to “terrible”, “dreadful” and “horrible”, the distinction seems
more vague than we can easily distinguish.
Further, a study by social science researchers reveals the complexity of this
problem hinges on many factors16. In this study, the author believes that evalua-
tion measurements are merely the percentage of times that human judgment agrees
with the system. However, another issue under the hood is human concordance,
which refers to the evaluation of agreement among humans annotators. The author
cites study by business companies that in sentiment analysis human concordance is
roughly 70% up to 79%. Given such a fact, a system is very likely to end up in a
perfect 100% accuracy while there exists a portion of 30% of disagreement with a
random human individual.
In summary, the idea of using precision, recall, F1 value and accuracy to
measure sentiment analysis is somewhat out of expediency. The ultimate goal of
sentiment analysis is to endow computers the ability to “feel” the emotion and act
with sentiment like humans. If humans are not capable of telling 100% correctly
sentiment in natural language, what should we expect from computers?
We discuss these problems here in hope that these thoughts would shed light
16http://brnrd.me/social-sentiment-sentiment-analysis/ (Date Last Accessed, March 29, 2016)
45
on interesting aspects of sentiment analysis. Further efforts should be taken to
scrutinize more closely such issues.
6. Conclusion
In this work, we explored three mainstream methodologies for sentiment analysis
over Twitter data, namely lexicon-based methods, rule-based methods and machine
learning-based methods.
Our major contributions are three-folds. First, we extensively study popular
sentiment lexicons and apply them with both Simple Word Count and Feature
Scoring approaches. Bing Liu’s Lexicon and Vader Sentiment Lexicon are proved to
be effective in Twitter sentiment analysis. Secondly, we proposed a set of Linguistic
Inference Rules. Those rules can help handle negation, valence shifter and contrast
in a natural language text. Our LIR rules help improve the precision and accuracy
of Twitter sentiment analysis. Last but not least, we compared two sets of features,
Bag-of-Words N-gram and Linguistic Features with state-of-the-art machine learning
classifiers. Bag-of-Words feature is simple yet effective. Bi-gram BOW feature
achieved best performance with all three models. Linguistic features could help
improve performance by slight percentage.
Two problems raised our attention. First, whether it is legitimate for us to
evaluate sentiment classification using precision, recall, F1 value and accuracy. Sec-
ond, whether it is worthy of the time and effort to apply rich linguistic features to
sentiment analysis considering the improvement is not very significant.
In the future, we would study further many related problems. On the one
hand, we would like to compare Twitter sentiment analysis with other domains.
Effective unsupervised or lexicon-based classifiers, domain adaptability, feature se-
lection are all relevant topics that need further research. On the other, given an
efficient sentiment analysis algorithm, we would like to see how it can be applied in
solving real-world problems. For example, predicting presidential election, estimat-
ing product reputation and movie rating etc. Furthermore, we would also dive into
the engineering aspects of Twitter sentiment analysis. Optimization and scalable
algorithms for big data are issues that need to be solved in the not-so-distant future.
46
LITERATURE CITED
[1] S. Muralidharan, L. Rasmussen, D. Patterson, and J.-H. Shin, “Hope for
haiti: an analysis of facebook and twitter usage during the earthquake relief
efforts,” Public Relations Rev., vol. 37, no. 2, pp. 175–177, Jun. 2011.
[2] D. L. Cogburn and F. K. Espinoza-Vasquez, “From networked nominee to
networked nation: examining the impact of web 2.0 and social media on
political participation and civic engagement in the 2008 obama campaign,” J.
Political Marketing, vol. 10, no. 1-2, pp. 189–213, Feb. 2011.
[3] Merriam-Webster, Merriam-Webster’s Collegiate Dictionary. Springfield,
MA: Merriam-Webster, 2004.
[4] B. Liu, “Sentiment analysis and opinion mining,” Synthesis Lectures on
Human Lang. Tech., vol. 5, no. 1, pp. 1–167, Apr. 2012.
[5] A. Farzindar and D. Inkpen, “Natural language processing for social media,”
Synthesis Lectures on Human Lang. Tech., vol. 8, no. 2, pp. 1–166, Sept. 2015.
[6] C. Strapparava and R. Mihalcea, “Learning to identify emotions in text,” in
Proc. the 2008 ACM Symp. Appl. Comput., Cear, Brazil, 2008, pp. 1556–1560.
[7] P. D. Turney, “Thumbs up or thumbs down?: semantic orientation applied to
unsupervised classification of reviews,” in Proc. 40th Annu. Meeting on Assoc.
for Computational Linguistics, Philadelphia, PA, 2002, pp. 417–424.
[8] B. Yang and C. Cardie, “Context-aware learning for sentence-level sentiment
analysis with posterior regularization.” in Proc. 52nd Annu. Meeting on
Assoc. for Computational Linguistics, Baltimore,MD, 2014, pp. 325–335.
[9] M. Hu and B. Liu, “Mining and summarizing customer reviews,” in Proc.
10th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining,
Seattle, WA, 2004, pp. 168–177.
47
48
[10] T. Wilson, J. Wiebe, and P. Hoffmann, “Recognizing contextual polarity in
phrase-level sentiment analysis,” in Proc. Conf. Human Lang. Tech. and
Empirical Methods in Natural Lang. Process., Vancouver, B.C., Canada, 2005,
pp. 347–354.
[11] T. Wilson, J. Wiebe, and R. Hwa, “Just how mad are you? finding strong and
weak opinion clauses,” in Proc. 19th Nat. Conf. Aitificial Intell., vol. 4, San
Jose, CA, 2004, pp. 761–769.
[12] Z. Zhang, D. Miao, and B. Yuan, “Context-dependent sentiment classification
using antonym pairs and double expansion,” in Web-Age Inform. Manage.,
Macau, China, 2014, pp. 711–722.
[13] N. Jindal and B. Liu, “Mining comparative sentences and relations,” in Proc.
21st Nat. Conf. Aitificial Intell., Boston, MA, 2006, pp. 1331–1336.
[14] A. Esuli and F. Sebastiani, “Sentiwordnet: a publicly available lexical resource
for opinion mining,” in Proc. Lang. Resources and Evaluation Conf., Genoa,
Italy, 2006, pp. 417–422.
[15] P. J. Stone, D. C. Dunphy, and M. S. Smith, “The general inquirer: a
computer approach to content analysis.” in Proc. Spring Joint Comput. Conf.,
New York, NY, 1966, pp. 241–256.
[16] C. Hutto and E. Gilbert, “A parsimonious rule-based model for sentiment
analysis of social media text,” in 8th Int. Conf. Weblogs and Social Media,
Ann Arbor, MI, 2014, pp. 216–225.
[17] S.-M. Kim and E. Hovy, “Identifying and analyzing judgment opinions,” in
Proc. Main Conf. on Human Lang. Tech. Conf. North Amer. Chapter of the
Assoc. of Computational Linguistics, New York, NY, 2006, pp. 200–207.
[18] V. Hatzivassiloglou and K. R. McKeown, “Predicting the semantic orientation
of adjectives,” in Proc. 35th Assoc. Computational Linguistics and 8th Conf.
European Chapter of the Assoc. Computational Linguistics, Madrid, Spain,
1997, pp. 174–181.
49
[19] H. Kanayama and T. Nasukawa, “Fully automatic lexicon expansion for
domain-oriented sentiment analysis,” in Proc. 2006 Conf. on Empirical
Methods in Natural Lang. Process., Sydney, Australia, 2006, pp. 355–363.
[20] L. Augustyniak, P. Szymanski, T. Kajdanowicz, and W. Tulig lowicz,
“Comprehensive study on lexicon-based ensemble classification sentiment
analysis,” Entropy, vol. 18, no. 1, p. 4, Dec. 2015.
[21] L. Augustyniak, T. Kajdanowicz, P. Szymanski, W. Tuliglowicz, P. Kazienko,
R. Alhajj, and B. Szymanski, “Simpler is better? lexicon-based ensemble
sentiment classification beats supervised methods,” in Proc. IEEE/ACM Int.
Conf. Advances in Social Network Anal. and Mining, Beijing, China, 2014,
pp. 924–929.
[22] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up?: sentiment classification
using machine learning techniques,” in Proc. ACL Conf. on Empirical
Methods in Natural Lang. Process., vol. 10, Philadelphia, PA, 2002, pp. 79–86.
[23] S. Tan, X. Cheng, Y. Wang, and H. Xu, “Adapting naive bayes to domain
adaptation for sentiment analysis,” in Adv. in Inform. Retrieval, Toulouse,
France, 2009, pp. 337–349.
[24] M. Gamon, “Sentiment classification on customer feedback data: noisy data,
large feature vectors, and the role of linguistic analysis,” in Proc. 20th Int.
Conf. on Computational Linguistics, Barcelona, Spain, 2004, pp. 841–847.
[25] T. Mullen and N. Collier, “Sentiment analysis using support vector machines
with diverse information sources.” in Proc. Empirical Methods in Natural
Lang. Process., Barcelona, Spain, 2004, pp. 412–418.
[26] S. Li, S. Y. M. Lee, Y. Chen, C.-R. Huang, and G. Zhou, “Sentiment
classification and polarity shifting,” in Proc. 23rd Int. Conf. on
Computational Linguistics, Uppsala, Sweden, 2010, pp. 635–643.
[27] B. Pang and L. Lee, “A sentimental education: Sentiment analysis using
subjectivity summarization based on minimum cuts,” in Proc. 42nd Annu.
50
Meeting on Assoc. for Computational Linguistics, Barcelona, Spain, 2004, p.
271.
[28] F. Li, C. Han, M. Huang, X. Zhu, Y.-J. Xia, S. Zhang, and H. Yu,
“Structure-aware review mining and summarization,” in Proc. 23rd Int. Conf.
Computational Linguistics, Uppsala, Sweden, 2010, pp. 653–661.
[29] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts,
“Learning word vectors for sentiment analysis,” in Proc. 49th Annu. Meeting
of Assoc. for Computational Linguistics: Human Lang. Tech., Portland, OR,
2011, pp. 142–150.
[30] F. Sebastiani, “Machine learning in automated text categorization,” ACM
Comput. Surveys, vol. 34, no. 1, pp. 1–47, Mar. 2002.
[31] M. Taboada, J. Brooke, M. Tofiloski, K. Voll, and M. Stede, “Lexicon-based
methods for sentiment analysis,” Computational Linguistics, vol. 37, no. 2, pp.
267–307, Sept. 2011.
[32] J. Allen, Natural Language Understanding. Upper Saddle River, NJ:
Pearson, 1987.
[33] J. Ramos, “Using tf-idf to determine word relevance in document queries,” in
Proc. 1st Instructional Conf. Mach. Learn., Washington D.C., 2003.
[34] D. Jurafsky, Speech & Language Processing. Upper Saddle River, NJ:
Prentice Hall, 2008.
[35] A. K. McCallum, “Mallet: A machine learning for language toolkit,” 2002.
[Online]. Available: http://mallet.cs.umass.edu (Date Last Accessed: March
29, 2016)
[36] C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel, S. Bethard, and
D. McClosky, “The stanford corenlp natural language processing toolkit.” in
Annu. Meeting on Assoc. for Computational Linguistics Syst.
Demonstrations, Baltimore, MD, 2014, pp. 55–60.
51
[37] C.-C. Chang and C.-J. Lin, “Libsvm: a library for support vector machines,”
ACM Trans. Intelligent Syst. and Tech., vol. 2, no. 3, p. 27, Apr. 2011.
[38] M. Sun, Z. Liu, M. Zhang, and Y. Liu, Chinese Computational Linguistics
and Natural Lang. Process. Based on Naturally Annotated Big Data. Berlin,
Germany: Springer, 2015.
APPENDIX A
Linguistic Resources
To address linguistic phenomenon that lexicon-based classifiers fail to handle, we
have created a series of rules. In this appendix, we will present some of the linguistic
resources that we collected. The negation expressions, valence shifter expressions are
taken from Vader [16] source code17. Contrasting Conjunctions shown are collected
manually.
Table A.1: Valence Shifter Expressions
Type WordsNegation aint, arent, cannot, cant, couldnt, darent, didnt,
doesnt, ain’t, aren’t, can’t, couldn’t, daren’t, didn’t,doesn’t, dont, hadnt, hasnt, havent, isnt, mightnt,mustnt, neither, don’t, hadn’t, hasn’t, haven’t, isn’t,mightn’t, mustn’t, neednt, needn’t, never, none, nope,nor, not, nothing, nowhere, oughtnt, shant, shouldnt,uhuh, wasnt, werent, oughtn’t, shan’t, shouldn’t, uh-uh, wasn’t, weren’t, without, wont, wouldnt, won’t,wouldn’t, rarely, seldom, despite
IntensifyingShifters
absolutely, amazingly, awfully, completely, considerably,decidedly, deeply, effing, enormously, entirely, especially,exceptionally, extremely, fabulously, flipping, flippin, ,frickin, frigging, friggin, fully, fucking, greatly, hella,highly, hugely, incredibly, intensely, majorly, more,most, particularly, purely, quite, really, remarkably, so,substantially, thoroughly, totally, tremendously, uber,unbelievably, unusually, utterly, very
WeakeningShifters
almost, barely, hardly, just enough, kind of, kinda,kindof, kind-of, less, little, marginally, occasionally,partly, scarcely, slightly, somewhat, sort of, sorta, sortof,sort-of
ContrastingConjunctions
but, although, though, even though, even if, however
17https://github.com/cjhutto/vaderSentiment (Date Last Accessed, March 29, 2016)
52