sentiment analysis of twitter data - computer …szymansk/theses/bo.ms.16.pdf · sentiment analysis...

SENTIMENT ANALYSIS OF TWITTER DATA

By

Bo Yuan

A Thesis Submitted to the Graduate

Faculty of Rensselaer Polytechnic Institute

in Partial Fulfillment of the

Requirements for the Degree of

MASTER OF SCIENCE

Major Subject: COMPUTER SCIENCE

Examining Committee:

Boleslaw K. Szymanski, Thesis Adviser

Sibel Adali, Member

Malik Magdon-Ismail, Member

Rensselaer Polytechnic InstituteTroy, New York

March 2016(For Graduation May 2016)

c© Copyright 2016

by

Bo Yuan

All Rights Reserved

ii

CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

ACKNOWLEDGMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 Sentiment Component . . . . . . . . . . . . . . . . . . . . . . 3

2.1.2 Levels of Study . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.3 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Sentiment Classification . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.1 Lexicon-Based Methods . . . . . . . . . . . . . . . . . . . . . 7

2.2.1.1 Sentiment Lexicon . . . . . . . . . . . . . . . . . . . 7

2.2.1.2 Lexicon-Based Classification Algorithms . . . . . . . 9

2.2.2 Machine Learning-Based Methods . . . . . . . . . . . . . . . . 9

2.2.2.1 Supervised Learning Methods . . . . . . . . . . . . . 9

2.2.2.2 Unsupervised Learning Methods . . . . . . . . . . . 10

2.2.3 Rule-Based Methods . . . . . . . . . . . . . . . . . . . . . . . 10

3. Proposed Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1 Lexicon-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1.1 Two Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1.2 Basic Lexicon-Based Methods . . . . . . . . . . . . . . . . . . 14

3.1.3 Linguistic Rules . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1.3.1 Negation . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1.3.2 Valence Shifter . . . . . . . . . . . . . . . . . . . . . 18

3.1.3.3 Contrast . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1.3.4 Linguistic Inference Rule . . . . . . . . . . . . . . . . 21

3.2 Machine Learning-Based Methods . . . . . . . . . . . . . . . . . . . . 22

3.2.1 Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.1.1 N-Grams . . . . . . . . . . . . . . . . . . . . . . . . 22

iii

3.2.1.2 Linguistic Features . . . . . . . . . . . . . . . . . . . 24

3.2.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4. Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.1 Data-set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.1.1 Motivation for Data Gathering . . . . . . . . . . . . . . . . . 27

4.1.2 Raw Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.1.3 Cleaning Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.1 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.2 Lexicon-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.3 Rule-based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.4 Machine Learning-Based Methods . . . . . . . . . . . . . . . . . . . . 38

5.5 Evaluation Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

LITERATURE CITED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

APPENDICES

A. Linguistic Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

iv

LIST OF TABLES

3.1 MPQA Example Entries . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2 Sample SentiWordNet Entries . . . . . . . . . . . . . . . . . . . . . . . 15

3.3 Example VADER Sentiment Lexicon . . . . . . . . . . . . . . . . . . . 16

3.4 N-Gram Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1 Positive and Negative Emoticons . . . . . . . . . . . . . . . . . . . . . . 28

4.2 Topic Key Word(s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.1 Average Performance of Baseline Algorithms . . . . . . . . . . . . . . . 33

5.2 Best Performance of Lexicon-Based Methods Across Domains . . . . . . 34

5.3 Average Performance of Lexicon-Based Methods Across Domains . . . . 36

A.1 Valence Shifter Expressions . . . . . . . . . . . . . . . . . . . . . . . . . 52

v

LIST OF FIGURES

3.1 Sample Part-of-Speech Tagging . . . . . . . . . . . . . . . . . . . . . . . 24

5.1 Results of Baseline Algorithms . . . . . . . . . . . . . . . . . . . . . . . 33

5.2 Results of Lexicon-Based Algorithms . . . . . . . . . . . . . . . . . . . 35

5.3 Results of Rule-Based Methods . . . . . . . . . . . . . . . . . . . . . . . 37

5.4 Comparison of Best Performance with LIR Algorithm . . . . . . . . . . 37

5.5 Comparison of Average Performance with LIR Algorithm . . . . . . . . 38

5.6 Naive Bayes with N-Gram Bag-of-Words Features . . . . . . . . . . . . 39

5.7 Maximum Entropy with N-Gram Bag-of-Words Features . . . . . . . . 40

5.8 Support Vector Machines with N-Gram Bag-of-Words Features . . . . . 41

5.9 Average Performance of N-Gram Bag-of-Words Features . . . . . . . . . 42

5.10 Average Performance of Machine Learning Classifiers with LinguisticFeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.11 Comparison of Linguistic and Bag-of-Words Features . . . . . . . . . . 43

vi

ACKNOWLEDGMENT

I would like to express my gratitude to my adviser, Professor Boleslaw Szymanski for

his generous support and kind help during my graduate study at RPI. I would also

like to thank Professor Adali and Professor Magdon-Ismail for graciously serving on

my thesis committee.

I would offer a special “Thank you” for RPI, my dearest alma mater. Life at

Rensselaer has been so wonderful that I would never forget. Thank you, RPI for

helping me find my “inner engineer”.

vii

ABSTRACT

Sentiment Analysis and Opinion Mining has become a research hot-spot with the

rapid development of social network websites.Twitter is a typical social network ap-

plication with millions of users expressing their sentiment every day. In this work,

we explored comprehensively the methodologies applied in sentiment classification

over Twitter data: lexicon-based, rule-based and machine learning-based methods.

Our data-set is crawled and manually cleaned with the principle of Naturally An-

notated Big Data. The data-set contains 20, 000 tweets ranging over ten popular

topics.

For lexicon-based methods, we experimented with the Simple Word Count ap-

proach and Feature Scoring approach using most popular sentiment lexicons and

semantic resources, namely MPQA subjectivity lexicon, SentiWordNet, Vader Sen-

timent Lexicon, Bing Liu’s lexicon and General Inquirer. We built customized sen-

timent lexicons, designed featuring scores and compared ten classifiers on real-world

Twitter data. Further, we designed Lingusitic Inference Rules(LIR) to improve

lexicon-based classifiers. LIR aims to handle negation, valence shift and contrast

conjunctions in natural language. For machine learning-based methods, we used

state-of-the-art supervised learning models: Naive Bayes, Maximum Entropy and

Support Vector Machines. Two sets of features are compared. The first set of fea-

tures is Bag-of-Words with N-Gram. The second set of features is Part-of-Speech

linguistic annotation.

viii

1. Introduction

Sentiment and opinion are essential features of human existence. “What do we

think” and “how do we feel” play a vital role in our daily life. The decisions we

make are closely related to the emotion and attitude of both ourselves and others.

With the rapid development of web2.0, an increasing number of people are

expressing their opinions on-line. E-commerce websites typical examples. Amazon

encourages customers to create reviews and provide feedback about the products

and services they purchase. By rating the products on a 5-star scale and writing

several paragraphs of review, the Amazon shoppers are able to share information

on “what people like or do not like”.

Social network website is another example where user-generated opinionated

data abounds. Social network websites usually contain a great scope of topics,

especially those related to big news events. Twitter, for example, is one of the most

popular social network websites to which people turn when big events occur. In 2010,

after a catastrophic 7.0 magnitude earthquake hit Haiti, Twitter served as a major

hub of information. Twitter was proven to be an important tool for fund-raising

and relief efforts [1]. Twitter has even changed the outcome of many historical

events, especially in political elections where millions of voters tweet frequently to

openly express their political approval or contempt. In the 2008 presidential election,

Twitter was integrated into President Obama’s campaign, which later proved to be

a huge success inspiring nunmerous academic studies [2].

Sentiment analysis research goes hand in hand with the Internet boom. On the

one hand, applications of sentiment analysis provide significant commercial value.

On the other, sentiment analysis systems provide basis for academic research in

computer science, linguistics, social science, management science etc..

In this research, we will focus sentiment classification of twitter data. The

remainder of this thesis is structured as follows. Chapter 2 surveys the field of

study on definition, sub-tasks and methodologies. Chapter 3 illustrates our proposed

methods. In chapter 4, experiment settings are described. Results of experiments

1

2

are discussed in Chapter 5. Finally, Chapter 6 summarizes our contributions and

points out future research directions.

2. Related Work

2.1 Definition

According to Merriam-Webster dictionary [3], the word sentiment has three

layers of meanings:

• Predilection or opinion.

• Emotion or refined feeling.

• Idea colored by emotion.

By definition, all the automatic analysis of properties of such kind falls into the

range of sentiment analysis. According to Liu [4], sentiment analysis is the field of

study that analyzes peoples opinions, sentiments, evaluations, appraisal, attitudes,

and emotions towards entities such as products, services, organizations, individuals,

issues, events, topics and their attributes. The term can be used interchangeably

with Opinion Mining.

Farzindar [5] separates sentiment analysis and emotion analysis to emphasize

the subtle difference. Emotion analysis is more meticulously classified into finer

granularity. In [6], emotion is categorized into six class: anger, disgust, fear, joy,

sadness, surprise, which are most widely used in the literature. There is currently

no consensus on how many classes of emotions should be used. Emotion analysis is

also referred to as mood detection.

The distinction of sentiment analysis and emotion analysis is beyond the scope

of this work. In our research, all semantic orientations per three layers expressed

towards certain entities are counted as sentiment.

2.1.1 Sentiment Component

Sentiment can be divided into different components: holder, target, aspect

and polarity. Each component corresponds to specific tasks in a system.

Holder denotes the entity that holds the sentiment.

3

4

Target identifies the entity selected as the aim of the sentiment.

Polarity is the property of sentiment. Polarity can be two-folds (positive and

negative) or three-folds (positive, negative and neutral).

Aspect defines the particular part or feature of target that the sentiment is ex-

pressed towards.

Let us take the following sentence as an example1:

Steve Jobs said that Microsoft simply has no taste.

The sentiment in this sentence can be analyzed as the holder (“Steve Jobs”)

expressed opinions towards the target(“Microsoft”) and the polarity of the sentiment

is negative (“has no taste”).

Aspect is also an important sentiment component. Let us take the following

company review text as an example2:

Demandware as a company has a positive, people-centric, forward

thinking culture. The benefits and work life balance are great. But

cross-functional communication can be challenging.

The user has given an overall evaluation of Demandware with respects to four

aspects: culture, benefits, work-life balance and communication. While the first

three aspects receive positive evaluation, the last one receives negative evaluation

as described below:

Culture positive (“positive, people centric, forward thinking”).

Benefits positive (“great”).

Work-life balance positive(“great”).

Communication negative(“challenging”).

1http://www.computerworld.com/article/2471632/(Date Last Accessed, March 1, 2016)2https://www.glassdoor.com/(Date Last Accessed, March 1, 2016)

5

In summary, holder, target, polarity and aspect are four major components

of sentiment. They work together to convey sentiment expressed in natural lan-

guage. All of the components have attracted extensive studies in sentiment analysis

research.

2.1.2 Levels of Study

Sentiment analysis can be categorized according to the granularity of text.

Previous work mainly focuses on three level:

• Document/text level

The analysis of this level is to determine whether sentiment expressed in a

whole document is positive or negative. For example [7], given product re-

views, the system would be able to evaluate the overall sentiment polarity.

Document level analysis assumes a piece of text expresses sentiment towards

a single target. While this is usually true for product review, movie review,

restaurant reviews etc., it probably does not apply to situations where a doc-

ument criticizes multiple targets.

• Sentence level

The analysis of this level is to determine whether the opinions expressed in

a sentence is positive, negative or neutral. Sentence level analysis can be

conducted in two ways. One way is to simply regard such analysis as a 3-way

classification task, where the labels are positive, negative and neutral. The

second way is to first detect subjectivity in the sentence to split opinionated

texts from those un-opinionated texts, then classify those subjective texts with

one of two labels(positive or negative).

The challenge of sentence level analysis is that each individual sentence is

semantically and syntactically connected with other parts of the text. There-

fore, this task requires both local and global contextual information. Yang [8]

analyzes product review on sentence level and addresses this challenge suc-

cessfully.

6

• Aspect / feature /entity level

Unlike document or sentence level analysis, aspect level analysis explores what

the holder likes or hates about the target. The tasks of such fine-grained

analysis is three folded [9]: (1) extracting features of target, (2) determining

feature-wise polarity, (3) summarizing the overall evaluation. Aspect level

sentiment analysis is one the most challenging tasks compared to other level

of analysis.

Besides, the research can also be conducted on phrase level [10], clause level

[11] or word level [12]. Some work also dives into comparative opinions [13] where

more than one targets are compared, unlike regular opinions where only one single

target in each text is evaluated.

Twitter sentiment analysis falls into document level. However, since Twitter

allows a maximum of 140 characters3, each tweet status tends to be very short.

Usually a tweet contains only one simple sentence or just several words. Therefore,

twitter sentiment analysis also calls for a wide variety of strategies utilized on other

levels of analysis.

2.1.3 Tasks

Major sentiment analysis tasks are defined by the sentiment component it con-

cerns. With the help of modern technology, researchers has been widely conducted

ranging from holder/target detection, sentiment classification, aspect extraction,

opinion spam detection etc. In our work, we will focus on document-level sentiment

classification. Specifically, given a tweet post, we will look into different methods of

assigning a polarity label.

2.2 Sentiment Classification

In this section, we will survey popular resources and methodologies used in

sentiment classification. By default, the task refers to document-level sentiment

3https://dev.twitter.com/overview/api/counting-characters(Date Last Accessed, March 29,2016)

7

classification where a whole document is regarded as an information unit. An as-

sumption made by researchers in this field is that the whole document under study

contains consistent sentiment polarity towards a single entity by a single holder.

Many types of reviews are a great example where the assumption holds true. For

tweet data, it is also true because tweets are usually short. It is not natural for user

to include complicated information in a single tweet. The methods can be generally

categorized into three classes: lexicon-based, machine learning-based and rule-based

methods.

2.2.1 Lexicon-Based Methods

2.2.1.1 Sentiment Lexicon

Sentiment lexicon refers to a list of words or phrases that conveys positive

or negative polarity information. Lexicon is very important resource in sentiment

analysis. It provides sentiment information about the smallest linguistic unit. Even

machine-learning based methods can rely on sentiment lexicon in feature engineer-

ing. Proper use of well-designed lexicon will improve the performance of sentiment

analysis system. In this part, we will introduce most popular lexicons used both in

the industry and academia. An overview of methods to compile customized senti-

ment lexicon is also provided.

MPQA subjective lexicon[10] is part of MPQA Opinion Corpus4. The lexicon

is made available under the terms of GNU License. Each entry represents a word and

its length, strength, Part-of-Speech and polarity. It provides a very comprehensive

amount of information which has implications for various fields of study.

SentiWordNet [14] adds real-value sentiment scores to each synset of WordNet

to denote its sentiment polarity (positive, negative and objective). Besides, Part-of-

Speech, context information is also incorporated. One advantage of SentiWordNet

is that it uses semantic resource to enhance the structure of the lexicon. Another

advantage is that it assigns both positive and negative scores to a single word.

General Inquirer5 [15] is an approach for computer-assisted text analysis. It

annotates each word as either positive or negative together with a whole series of

4http://mpqa.cs.pitt.edu/(Date Last Accessed, March 29, 2016)5http://www.wjh.harvard.edu/ inquirer/(Date Last Accessed, March 29, 2016)

8

very rich linguistic, semantic, syntactic and pragmatic information.

VADER Sentiment Lexicon6 [16] is a comprehensive list of “gold-standard”

sentiment words especially applicable to micro-blog and other social network text

data. Providing both polarity and intensity, VADER is validated by human ex-

perts. Besides common dictionary words, it also gives information on emoticons,

slang(“nah”,“meh” etc.) and acronyms(“LOL”, “LMAO” etc.).

Bing Lius lexicon7 [9] is one of the most popular sentiment lexicon for English

language. It contains 2006 positive words and 4783 negative words. The lexicon

excels at practical tasks because it contains misspelling, slang and web-language

variants of entries.

Aside from those lexicons mentioned in previous part, researchers tend to build

customized lexicons and tailor them according to their need. Two type of approaches

are known: dictionary-based and corpus-based.

Dictionary-based approaches make use of lexical databases like WordNet to

expand a manually created seed set. The automatic expansion will explore pair-

wise word relations and generate a lexicon of proper size. The first work of such

propagation is used in [9]. An extension of this method is [17] where the results

of propagation are further pruned and sentiment strength is assigned to each work

using probabilistic methods.

Although dictionary based approached can generate large number of sentiment

words, those words are usually context and domain independent. Corpus-based ap-

proaches can usually solve such kind of problems. The first work is [18] where

linguistic connectives are utilized to determine the polarity of adjectives. The foun-

dation of this work is the “sentiment consistency” of natural language where people

tend to use “AND” to combine words with similar semantic orientation, e.g. “beau-

tiful and smart” is legitimate English phrase while “beautiful and disgusting” is

not likely to be used in real-world language. An extension of this method is [19].

In this work, the author explores the inter-sentential and intra-sentential sentiment

consistency. This study is proved useful in generating domain-dependent sentiment

6https://github.com/cjhutto/vaderSentiment (Date Last Accessed, March 29, 2016)7https://www.cs.uic.edu/ liub/FBS/sentiment-analysis.html (Date Last Accessed, March 29,

2016)

9

words.

In our work, we built on top of popular existing lexicons and proposed cus-

tomized scoring functions.

2.2.1.2 Lexicon-Based Classification Algorithms

The motivation behind lexicon-based classification algorithms is that the senti-

ment of a document is determined by the dominant components (words or phrases).

The basic schemes include majority voting, document scoring with thresholding and

simple word counting [20].

Lexicon based methods usually provide a baseline for further study. Recently

there has been a trend of using ensemble learning with multiple weak lexicon-based

classifiers. Augustyniak et. al. [21] use a variety of lexicon-based weak classifiers

and a C3.4 decision tree as strong classifier. The lexicon extraction method is called

Frequentiment [20] and it is proved 3 to 5 times faster than supervised learning.

While this is very informative and promising, no similar known work has been

conducted to test its effectiveness in English language text.

In our work, we would apply two approcahes on twitter data, Simple Word

Count and Feature Scoring. The detailed description will be discussed in the next

chapter.

2.2.2 Machine Learning-Based Methods

Sentiment classification, by its nature, is a type of two-way text categorization

task. Text categorization usually classifies data into several pre-defined categories.

It is a well-studied field with very mature solutions and applications. The majority of

research in both text categorization and sentiment analysis fall into machine learning

based methodology. In this section, We will briefly overview both supervised and

unsupervised methods.

2.2.2.1 Supervised Learning Methods

Model The first work using machine learning for sentiment analysis is [22]. The

models experimented in this work has been widely used later, namely Naive Bayes

[23], Maximum Entropy and Support Vector Machines [24, 25, 26]. Pang [27]

10

proposed a minimum cuts algorithm to incorporating cross-sentence constraints

an improve efficiency. Li [28] built a framework based on Conditional Random

Fields(CRFs) which is capable of employing joint features for review sentences.

Feature Ever since Pang [22], algorithms and features have been actively de-

veloped and applied in sentiment analysis. Those features include uni-gram and

n-gram term frequency, sentiment words, rules, word position, length measures etc

[4]. Among all features, rich linguistic features have been used such as Part-of-

Speech [24], syntactic structures [28], valence shifters [26], semantic relations [29]

etc.

2.2.2.2 Unsupervised Learning Methods

Using dominance of sentiment words for sentiment classification starts from

Turney [7]. To determine sentiment polarity of a document, the algorithm needs

the following steps:

1. Extract phrases using a manually-created template list.

2. Estimate the sentiment orientation of the extracted phrases using pairwise

mutual information(PMI) approximated with the assistance of a search engine.

3. Compute the sentiment orientation of a whole document and determine the

polarity with a threshold.

2.2.3 Rule-Based Methods

The first automatic text categorization systems relied heavily on knowledge

engineering techniques [30], where a set of human-created logical rules would be

applied. Building such an expert system is usually labor-intensive, time-consuming

and expensive.

The study of sentiment analysis emerges after text categorization became a

nearly “solved-problem”. Therefore most research pursues a machine learning-based

methodology. There are very few pure rule-based methods or systems that we know

of. Most rules are incorporated in lexicon-based systems to improve performance.

11

VADER [16] is a rule-based model with rich lexical features. It aims at sentiment

analysis in micro-blog data and achieves effective and generalizable results compared

to other state-of-the-art methods.

In our work, we have also incorporated simple linguistic rules which address

issues that lexicon-based classifiers fails to handle successfully.

It is a convention for sentiment analysis researchers to categorize methods as

“lexicon-based” and “machine learning-based”. Conceptually most of the lexicon-

based methods can be regarded as “unsupervised” or “semi-supervised” learning

methods. Taboada [31] is the most comprehensive work which uses sentiment lexicon

and incorporates intensification and negation to achieve consistent across-domain

performance.

In this work, we first focused on supervised-learning methods and lexicon-

based methods. We will explore the most successful models and features that have

been proven to be effective in the literature. We also covers linguistic rules and

features to see how they can help in the context.

3. Proposed Methods

3.1 Lexicon-Based Methods

3.1.1 Two Approaches

The basic assumption of lexicon-based methods is that the sentiment of a

document is determined by its dominant sentiment words. For simplicity, the “word”

in our word may refer to both uni-gram word or phrase in linguistic scenario. There

are two approaches to calculate such “dominance”.

Simple Word Count(SWC) Given a sentiment lexicon l, a document d =

{w1, w2, ..., wn} where wi(1 ≤ i ≤ n) represents the ith word in the document.

Let pos(l, d) denotes the occurrence of positive words in d and neg(l, d) denotes the

occurrence of negative words in d. The overall sentiment word sum of the document

sum(d) is calculated as:

sum(l, d) = pos(l, d)− neg(l, d) (3.1)

The sentiment orientation of d (1 denoting “positive” and −1 denoting “negative”)

can be defined as:

sSWC(l, d) =

1, sum(l, d) > 0,

−1, sum(l, d) < 0,

RC(d), otherwise.

(3.2)

To fit in our problem, we assign a random label to d when the sentiment word sum

is 0.

12

13

Feature Scoring(FS) Given a sentiment lexicon l, the socring function of a fea-

ture f maps a feature to a real-valued number where:

score(l, f)

> 0, if positive,

= 0, if neutral,

< 0, if negative.

(3.3)

The scoring function not only defines the polarity (“positive” or “negative”),

but it also depicts the degree of sentiment polarity. This is based on the intuition

that sentiment features have degrees. Suppose we extract sentiment words as fea-

tures, words like “good”, “great” , “awesome” etc. can denote different level of

positiveness and words like “bad”, “aweful”, “horrible” etc. can denote different

level of negativeness.

Given document d = {f1, f2, ..., fn} where fi(1 ≤ i ≤ n) represents the ith

feature in d, the overall sentiment sum of d can be calculated as:

sum(l, d) =n∑

i=1

score(l, fi). (3.4)

By selecting a threshold δ ≈ 0, the sentiment orientation of d can be defined as:

sfc(d) =

1, sum(l, d) > δ,

RC(d),−|δ| ≤ sum(l, d) ≤ |δ|,

−1, sum(l, d) < δ.

(3.5)

To fit in our problem, we assign a random label to d when the sum value is between

threshold interval.

Simple Word Count is a special case of Feature Scoring where:

• Word is extracted as a feature,

• Each positive word is scored 1.0,

• Each negative word is scored −1.0,

14

• δ is slected as 0.0.

For Simple Word Count method, the key is to create a lexicon with polarity

attached to each word entry. For Feature Scoring method, the key is to extract

features, define an effective scoring function and find an accurate threshold δ.

3.1.2 Basic Lexicon-Based Methods

MPQA Subjectivity Lexicon(MPQA) Here are two examples of MPQA entry

for words “abandoned” and “impassive”.

Table 3.1: MPQA Example Entries

Word Annotationabandoned type=weaksubj len=1 word1=abandoned

pos1=adj stemmed1=n priorpolar-ity=negative

impassive type=weaksubj len=1 word1=impassivepos1=adj stemmed1=n polarity=negativepriorpolarity=weakneg

As depicted by Table 3.1, MPQA lexicon annotates words with their type,

length, string, Part-of-Speech and other features. Among those features, we only

consider “priorpolarity” and “polarity”. The former denotes the word’s sentiment

orientation in context and the latter denotes the word’s context-independent polar-

ity.

We compiled a polarized lexicon with MPQA lexicon. The polarity of a word

is defined by its polarity in application, otherwise it is defined by its prior-polarity.

General Inquirer(GI) For each entry, General Inquier annotates at most 186

properties, which makes it a perfect tool for rich linguistic feature extraction. In

our task, we built a lexicon using the “Positive” and ‘Negative” properties.

Bing Liu’s Lexicon Bing Liu’s lexicon has already categorized words into “pos-

itive” and “negative” classes. We directly copied the lexicon with small amount of

encoding conversion.

15

SentiWordNet(SWN) SentiWordNet provides users with clusters of synony-

mous words ready to be used in sentiment analysis tasks. Sample entries can be

found in Table 3.2.

Table 3.2: Sample SentiWordNet Entries

POS ID PosScore NegScore SynsetTerms Glossa 00019131 0.625 0 accessible#1 capable of being

reached; “a townaccessible by rail”

a 00019731 0.125 0.125 ready to hand#1handy#1

easy to reach; “founda handy spot for thecan opener”

n 15247410 0 0 ephemera#1 something transitory;lasting a day

v 02771756 0 0 run dry#1dry out#2

become empty of wa-ter; “The river runsdry in the summer”

From the Table 3.2 we can see:

• SentiWordNet provides real-value positive score and negative score(PosScore

and NegScore) for each entry.

• SentiWordNet contains not only uni-gram words, but also multiwords expressions(n-

grams).

• SenttiWordNet clusters words with similar sentiment orientation together into

different sets. For example, “run dry” and “dry out” are in the same set.

Based on our observation, such features can help determine word’s polarity,

extract n-gram feature and design scoring function.

With the real-valued PosScore and NegScore for each entry, we can determine

a word’s polarity and sentiment degree. Given a SentiWordNet entry word w, the

polarity can be determined as follows:

polswn(w) =

1, if PosScore(w) > NegScore(w),

−1, if PosScore(w) < NegScore(w),

0, otherwise.

(3.6)

16

We excluded words with polarity of 0 in our lexicon.

We designed a simple scoring function, the first one directly uses the scores

provided by SentiWordNet. Given a word w, the scoring function is as follows:

scoreswn(w, l) = PosScore(w)−NegScore(w). (3.7)

Besides, we can use SentiWordNet for n-gram feature extraction. In this way,

not only can we handle sentiment words, we can also address phrases which are

essential in opinion expressing.

Vader Sentiment Lexicon(VSL) VADER is a lexicon with both polarity and

intensity information attached to each entry. The basic structure is shown in Table

3.3.

Table 3.3: Example VADER Sentiment Lexicon

Entry Intensity Std. Human Evaluation Vectoraccomplish 1.8 0.6 [1, 2, 3, 2, 2, 2, 1, 1, 2, 2]

dangers -2.2 0.87178 [-1, -1, -2, -4, -2, -3, -3, -2, -2, -2]lmao 2.0 1.18322 [3, 0, 3, 0, 3, 1, 3, 2, 3, 2]=] 1.6 0.8 [2, 1, 3, 1, 1, 1, 2, 3, 1, 1]

The intensity of each word is calculated by averaging human evaluation vector

gathered from ten experts’ annotation. The lexicon only contains entries with a

standard deviation less than 2.5.

Based on the information we observed, the polarity of each word entry w can

be determined as:

polvader(w) =

1, if Intensity(w) > 0,

−1, if Intensity(w) < 0.(3.8)

The words with intensity of 0 have already been removed by the author.

VADER provides uni-gram word entries as feature. To be used for feature

scoring algorithm, two scoring functions can be designed based on the VADER

17

lexicon. The first one uses directly the intensity as sentiment score:

scorevader0(w, l) = Intensity(w). (3.9)

The other scoring function used a normalized intensity:

scorevader1(w, l) =Intensity(w)

d. (3.10)

Where d is the range of intensity column of the lexicon. GivenW = {w1, w2, ..., wi, ..., wn}where 1 ≤ i ≤ n and n is the size of VADER lexicon, d can be calculated as:

d = MAX(Intensity(W ))−MIN(Intensity(W )). (3.11)

This way VADER is prepared to be used in lexicon-based sentiment classification

tasks.

3.1.3 Linguistic Rules

Lexicon-based classification is an simple yet useful idea but it fails to cover

many language phenomena. To better scale lexicon-based algorithm to real-world

text data, strategies need to be devised. The detailed linguistic and logical analysis

is beyond the scope of this work. In this section, we will mainly introduce three

kind of rules and corresponding solutions that later proved effective by our work.

3.1.3.1 Negation

Negation is a common device in a natural language to reverse the truth value

of one or several unit(s). It is usually implemented using adverbs like “not”, “never”

etc.

In the following example, “won” is supposed to be a positive sentiment word

but with “never”, the overall sentiment polarity is reversed from “positive” to “neg-

ative”.

RT @coolknifeguy: Leo has never won two Oscars :(

18

Another example below demonstrates how negation word “never” has reversed the

negativeness of “bored” and rendered the whole tweet a positive sentiment.

I love @BigBang CBS Watching reruns never get bored of the big

bang theory :)

Based on our observation, we made the following assumptions:

• If a tweet contains negative expression, this tweet entails negation.

• Negation would reverse the polarity of sentiment features in the sentence.

• Negation would change the sign of the featuring sentiment score.

To implement this rule, we first collected a set of negation expressions and

their variants as shown in Table A.1. Then we defined all non-alphanumerical, non-

blank space characters, end-of-file and start-of-file as sentence delimiters. Thus we

could implement the following negation inference rules:

Negation Inference Rule(NIR): For a tweet t = {w1, w2, ..., wi, ..., wn}where n is the text length and 1 ≤ i ≤ n, we defines a sliding window of

size k. If wi is a negation expression, for any sentiment word wj where

|i − j| ≤ k within a sentence, the polarity and the sign of sentiment

score given any sentiment lexicon l is reversed(i.e. from “positive” to

“negative” or the other way around).

3.1.3.2 Valence Shifter

Valence shifter refers to the device in natural language to intensify or weaken

the degree per certain property of specific language unit(s). A typical example is

“very”, “more”, “fairly” etc.

The following example demonstrates the valence shifter “very” intensifies the

sentiment degree of “disappointing” and outweighs the positiveness of sentiment

word “loyal”. Eventually the tweet should be assigned a label of “negative”.

Very disappointing how @AudiMinneapolis treats a loyal customer.

:(

19

From the examples, we made the following assumptions:

• If a tweet contains valence shifter expressions, this tweet entails valence shift

phenomenon.

• Valence shifter can intensify or weaken the sentiment degree of the proceeding

sentiment words.

• Valence shifter does not affect the sentiment degree before the valence shifter

expressions.

• There can be multiple valence shifter expressions, but we will only use the first

one and remove the others.

To implement this rule, we first manually collected a set of valence shifter

expressions shown in Table A.1. Then we defined the same sentence delimiters as

in negation inference rule. Thus we had the following valence shift rule:

Valence Shifter Rule(VSR): For a tweet t = {w1, w2, ..., wi, ..., wn}where n is the text length and 1 ≤ i ≤ n, we defines a sliding window

of size k. If wi is the first valence shifter expression, for any sentiment

word wj where 0 ≤ j− i ≤ k within a sentence, the polarity remains the

same and the sentiment score given any sentiment lexicon l is intensified

as αscore(wj, l), (α ≥ 1) or weakened as βscore(wj, l), (0 ≤ β ≤ 1).

3.1.3.3 Contrast

Contrast is the mechanism in a language which joins two or more smaller units

with opposite properties into a bigger unit. The language units can be clauses, sen-

tences, paragraphs etc. Usually, contrast is substantiated using “but”, “although”,

“however” etc.

The following example tweet consists of two clauses with a contrasting con-

junction “but”. While the first clause, with sentiment words “impressive”, “love”

etc., appears very positive at first glance. However, the semantic orientation is

determined by the second clause (also the “main clause”).

20

@Sprite37 I’d rly love to play DS3 bc Bloodborne’s combat looked

so impressive, but sadly I have no PS4 :(

Another example tweet also consists of two clauses joined by a conjunction

“but”. The polarity of second clause is vague because there is no obvious sentiment

words. However, with the help of the first positive clause, we can reverse the polarity

and infer the second clause being negative. Therefore, the whole tweet should be

labeled as negative.

ok marco rubio is kinda hot but he’s a republican :(

Based on our observation, we made the following assumptions:

• A tweet can consist of several clauses or sentences joined by contrasting con-

junctions.

• If a tweet contains contrasting conjunction expressions, then the string se-

quence before the conjunction is regarded as the secondary clause, the string

sequence proceeding the conjunction is regarded as the second clause.

• The over-all polarity of the tweet is consistent with the main clause.

• The polarity can be directly determined by the main clause or inferred from

the secondary clause by reversing its polarity.

• There can be multiple contrast conjunctions in a tweet, but we only handle

the secondary one.

To implement this rule, we first collected a set of contrast conjunction expres-

sions shown in Table A.1. Then we had the following contrast inference rule:

Contrast Inference Rule(CIR): Given a tweet t = {w1, w2, ..., wi, ..., wn}where n is the text length and 1 ≤ i ≤ n. Let’s assume that wi is

the first contrast conjunction expression. Then, the tweet can be di-

vided into two clauses: c0 = {w1, w2, ..., wi−1}(secondary clause) and

c1 = {wi+1, wi+2, ..., wn}(main clause). Sentiment polarity of t is consis-

tent with c1 and is the reverse of c0.

21

In our implementation, we prioritized main clause over the secondary clause. The

algorithm is shown as follows:Data: Tweet t with Contrast Conjunction t, Lexicon l

Result: Polarity Label for Given Tweet t

[c0, c1]← split t by contrast conjunction ;

if score(c1, l) 6= 0 then

pol(t, l)← pol(c1); /* Use the main clause */

else if score(c1, l) = 0 then

pol(t, l)← −pol(c0); /* Use the first clause */

else

pol(t, l)← RC(t); /* Assign a randome label */

end

Algorithm 1: Contrast Inference Rule(CIR) Algorithm

3.1.3.4 Linguistic Inference Rule

Based on our discussion from previous parts, our linguistic rules will be re-

sponsible for handling negation, valence shift and contrast in tweet text. We sim-

plify the problem even further by assuming the importance order of those rules is

CIR > NIR > V SR. The pipeline analysis is shown as follows:Data: Tweet t, Lexicon l

Result: Polarity Label for t

if t entails contrast then

classify t using CIR rules ;

else if t entails negation then

classify t using NIR rules ;

else if t entails valence shift then

classify t using VSR rules ;

else

classify t using standard lexicon-based algorithms ;

end

Algorithm 2: Linguistic Inference Rule(LIR) Algorithm

Despite the fact that LIR Algorithm simplifies the process and avoids some

edge cases where multiple linguistic phenomenons coexist in a tweet, it is helpful

22

in saving us from involved logical inference and knowledge engineering. As demon-

strated by the experiments in later chapter, the rules and algorithms helped improve

the system to certain extent.

3.2 Machine Learning-Based Methods

3.2.1 Feature

Twitter data are just sequences of string characters. To use automatic clas-

sification algorithms, special representation must be used to make it suitable for

computation. In our work, we used two types of presentation: Bag-of-Words N-

grams and linguistic features.

3.2.1.1 N-Grams

Bag-of-Words is one of the most successful feature representation in text cat-

egorization tasks. Under this model, text input is represented as a vector of tokens

with their corresponding numeric values.

To precess piece of tweet from raw text to bag-of-words representation, the

following steps would be taken:

• Tokenize the input text from character sequence into tokens.

• Convert the token strings to lower-cases.

• Remove stop words(function words like “the”, “of”, “a” etc. and punctua-

tions).

• Convert tokens from string to integer feature indexes.

• Convert feature sequences to feature vectors by certain computation.

The computation for convert feature sequence to vector varies and there are

many sophisticated methods, such as presence(0 or 1), frequency(word count), IDFIn-

verse Document Frequency[32], TF-IDF [33] etc.

In Bag-of-Words model, N-gram refers to a slice of a longer feature consist-

ing of n contiguous tokens. N-gram was originally used in language modeling [34]

23

by researchers interested in the probability of a word given its uses in the given

documents. It is widely used in information retrieval and text mining. The most

common ones are uni-grams, bi-grams, tri-grams or even higher grams.

Consider the following tweet:

No, Adele. I love you, but you’re not going to make me cry today. Next!

lol

Table 3.4 shows the feature vector of unigram, bi-gram and tri-gram feature

vectors(frequency) of the tweet above together with other tweets in our data-set.

To display properly, we have skipped those features with a value of 0. With the size

of n-gram grows, the feature space expand rapidly and each vector becomes vastly

sparse.

Table 3.4: N-Gram Examples

N-gram Feature Vectoruni-gram love(92)=1.0, make(136)=1.0, cry(140)=1.0, adele(229)=1.0, to-

day(362)=1.0, lol(644)=1.0bi-gram you re(63)=1.0, i love(218)=1.0, to make(353)=1.0,

make me(354)=1.0, me cry(364)=1.0, going to(802)=1.0,adele i(976)=1.0, not going(1790)=1.0, no adele(2075)=1.0,love you(2087)=1.0, you but(7417)=1.0, but you(7418)=1.0,re not(7419)=1.0, cry today(7420)=1.0, today next(7421)=1.0,next lol(7422)=1.0

tri-gram to make me(366)=1.0, i love you(2380)=1.0,not going to(3265)=1.0, no adele i(8957)=1.0,adele i love(8958)=1.0, love you but(8959)=1.0,you but you(8960)=1.0, but you re(8961)=1.0,you re not(8962)=1.0, re not going(8963)=1.0, go-ing to make(8964)=1.0, make me cry(8965)=1.0,me cry today(8966)=1.0, cry today next(8967)=1.0, to-day next lol(8968)=1.0

In our work, we compared the effectiveness of n-gram as features for tweet sen-

timent classification. Our major tool for pre-processing, cleaning and computation

is MALLET8 [35].

8http://mallet.cs.umass.edu/ (Date Last Accessed, March, 29 2016)

24

3.2.1.2 Linguistic Features

Linguistic features refers to those features incorporating rich linguistic anno-

tation, including Part-of-Speech, semantic relations, syntactic structures etc. Such

features usually relies on highly-accurate taggers(and parsers). Rich linguistic fea-

tures are essential for deep natural language understanding.

Given the following tweet “my mom won’t stop calling me Justin bieber since

I got my hair cut”, the Part-of-Speech tagging is shown in Figure 3.1.

Figure 3.1: Sample Part-of-Speech Tagging

For our task, we used Stanford CoreNLP toolkit [36]. It is a highly optimized

Maximum Entropy tagger with success in cross-domain natural languagr processing

tasks. The tag set Stanford CoreNLP used is from PennTreeBank9. Those tags are

the Part-of-Speech of words which denotes the syntactic and semantic function of a

word. For example, “NN” refers to singular nouns (“mom”, “hair”), “PRP$” refers

to possessive pronoun(“my”), “VBD” refers to verb of past tense etc. More details

on the tag set can be found on PennTreeBank website.

Part-of-Speech is a very important feature of natural language. It helps

when two words share the same form but have totally different meaning, such as

“play(noun or verb)”, “book(noun or verb)” etc. With an accurate label of linguis-

tic role attached to a feature, the ambiguity of natural language is expected to be

solved to a great extent.

3.2.2 Model

For our experiment, we have chosen three state-of-the-art models in classifica-

tion tasks. Since those models are well-studied, we will focus on each model’s ap-

plication in sentiment classification and it’s configuration in our experiment. Naive

Bayes(NB) is a simple classifier that is based on Bayes rule and conditional indepen-

dence assumption. It assigns a class label with the maximum conditional probability

9http://www.cis.upenn.edu/ treebank/ (Date Last Accessed, March 29, 2016)

25

given a training set. Maximum Entropy(ME) is a highly effective classifier model

that iteratively searches for and optimizes feature-weight parameters to maximize

the likelihood of the training set. Support Vector Machines(SVM) model aims find-

ing a decision surface which maximize margin between two classes.

The models we chose are based on the first sentiment analysis work using

machine learning [22]. The toolkits we have used for our experiment for machine

learning implementation are MALLET(Naive Bayes and Maximum Entropy) [35]

and LibSVM(Support Vector Machines) [37].

3.3 Evaluation

We used the standard matrices in text categorization to evaluate various clas-

sifiers. Suppose we have a set of classification results, in each class of n documents,

cij (0 ≤ i ≤ n−1, 0 ≤ i ≤ n−1) denotes the number of instances where a document

in the ith class is categorized as belonging to the jth class. The per-class measures

can be calculated as follows:

Precision Assessment of what fraction of instances are correctly classified.

p =cii∑j

cji

. (3.12)

Recall Assessment of what fraction of correct instances are classified.

r =cii∑j

cij

(3.13)

F-measure Assessment combining both precision and recall. It helps researchers

achieves a balance through the trade-off between precision and recall. The most

common F-measure is F1 measure:

F1 =2(p× r)(p+ r)

(3.14)

26

Accuracy Assessment of what fraction of instances are correctly classified across

all classes.

a =

∑i

cii∑j

∑i

cij

(3.15)

Based on these per-class measurements, we have two types of averaging eval-

uation methods: macro average and micro average.

Micro-average Create a contingency table for all classes then compute the pre-

cision, recall and F1 measure of the whole data-set as one “big class”.

Macro-average Compute the precision, recall and F-measure for each class, then

average the sum over number of classes.

In our work, we used macro-averaging measurements to evaluate the perfor-

mance of our algorithms.

4. Experiment

4.1 Data-set

The data-set we created for our experiment are collected from newly posted

twitter status(tweets) from February, 2016. The web crawler is Twitter4J10, a third-

party Java tool for Twitter API.

4.1.1 Motivation for Data Gathering

Normally data-set for sentiment analysis is manually annotated by domain

experts, researchers and linguists. However, hand-labeled data tends to be expensive

and time-consuming. To gather twitter data, we combined Naturally Annotated

Big Data with manual cleaning. Before a detailed description, we would present our

motivation here.

Naturally Annotated Big Data(NADB) [38] refers to the data generated from

“natural user behavior”. For example, a user of TripAdvisor website might post a

status saying:

I like Tokyo, Beijing, Shanghai and other cities.

By analyzing this sentence, it is very easy for computer programs to extract cer-

tain “is-kind-of” relation: Beijing, Shanghai and Tokyo are cities. Such kinds of

phenomenons are ubiquitous in web-pages, blogs, tweets and other kinds of textual

data.

The natural annotation we used in our data gathering is emoticon11. Emoti-

cons are the tokens representing facial expressions using punctuation marks and

alphanumeric characters. Our assumption is that users will use “happy” emoticons

to express positive sentiment and “sad” emoticons to express negative sentiment.

In very rare case would the opposite situation happen.

10http://twitter4j.org/en/index.html (Date Last Accessed, March 21, 2016)11The full list of Twitter emoticons can be found at: http://emojipedia.org/twitter/ (Date Last

Accessed, March 10, 2016)

27

28

Another assumption is that if a user mentions a key-word in a tweet, the

tweet is about the topic represented by this keyword. We have chosen 10 topic lists

containing popular entity words/phrases in hope that those topic key word(s) can

help gather topic-specific data via query function provided by Twitter API.

4.1.2 Raw Data

We first selected a list of several “positive” emoticons and “sad” emoticons as

shown in table 4.1. The meaning of those emoticons are deterministic with least

possible ambiguity.

Table 4.1: Positive and Negative Emoticons

Polarity EmoticonsPositive :), ;), :D, :-), :-DNegative :(, :-(, :’(, :’-(, D:

Second, we collected nine lists of key word(s) pertaining to nine topics. Those

key word(s) are names of entities(celebrity, commercial brand, title of amovie or a

TV show etc.) which have been warmly discussed either by mass media or social

network users. The detailed description of topics and examples are shown is provided

in Table 4.2.

After the emoticon list and topic lists are prepared, we are ready to crawl

topic-related tweets using Twitter API. For each key word, a query of “key word

+ emoticon” will return a collection of results belonging to the specific topic with

certain sentiment polarity. For example, a query with “Taylor Swift” with “:)” will

return a collection of tweets under the topic “artist” with polarity “positive”, while

a query of “AngularJS” with “:(” will return a collection of tweets under the topic

“technology” with the polarity “negative”. We iteratively went through all nine

topic lists and crawled a positive tweet set and a negative tweet set for each topic.

To end up the data-crawling process, we collected another two tweet sets with

only positive or negative emoticon string literals as queries. In this way, we collected

data for a general topic with no specific domain.

After these steps, we have successfully built a raw data pool with positive and

negative twitter data for ten topics(including one with general-domain topic).

29

Table 4.2: Topic Key Word(s)

# Topic Description Examples1 Artist Names of popular mu-

sicians or actorsTaylor Swift, Lady Gaga,Alessia Cara.

2 Automobile Brand names of popu-lar cars

Aston Martin, Audi, BMW,Buick.

3 Game Names of populargames on all plat-forms

Batman: Arkham Knight,Halo 5: Guardians.

4 IT Company Names of famous ITcompanies

Oracle, SAP, Fujitsu, Ac-centure.

5 Movie Names of popularmovies

Mad Max: Fury Road,Jurassic World, Furious 7.

6 Politician Names of 2016 presi-dential candidates

Hillary Clinton, DonaldTrump, Ted Cruz.

7 Software Names of popular soft-wares across all plat-forms

Yik Yak, Instagram, Zillow,Fitbit.

8 Technology Names of popular soft-ware engineering tech-nologies

AngularJS, Java Spring,MeteorJS, CakePHP.

9 TV Show Names of popular TVshow on Netflix.com

Game of Thrones, Grey’sAnatomy, Vikings.

4.1.3 Cleaning Data

Based on the raw data collected as described in the previous section, our data-

cleaning processes are as follows:

1. Remove non-English tweets.

2. Remove blank symbols(new lines, space, tab etc.).

3. Remove Unicode characters.

4. Remove tiny links12, retweet key word(s)(“RT”) and usernames(“@username”)

generated by Twitter system.

5. Remove emoticons used in the data-crawling stage.

6. Randomly select 1,000 positive tweets and 1,000 negative tweets for each topic.

12https://support.twitter.com/articles/78124 (Date Last Accessed, March 10, 2016)

30

The data-set13 consists of ten topics with 1,000 records in positive and negative

polarity. This finalizes our data preparation for the experiments.

4.2 Setting

In the experiment, we explored three types of methods introduced in the pre-

vious chapter: lexicon-based methods, rule-based methods and machine learning-

based methods. This section will introduce the order and detailed configurations.

Baseline methods We used two methods as our general baseline against which

we can improve our system.

1. Random Classifier(RC) Given a tweet, a random class label from the label

set would be assigned.

2. Most Frequent Classifier(MFC) Given a tweet, a class label with maxi-

mum occurrence on the training corpus would be assigned.

Lexicon-Based Methods Lexicon-based methods that we experimented with in

our sentiment lexicons and simple word count or feature scoring approaches are:

1. MPQA-SWC MPQA lexicon with Simple Word Count approach.

2. GI-SWC General Inquirer lexicon with Simple Word Count approach.

3. BL-SWC Bing Liu’s lexicon with Simple Word Count approach.

4. SWN-SWC SentiWordNet lexicon with Simple Word Count approach.

5. VSL-SWC Vader Sentiment Lexicon with Simple Word Count approach.

6. SWN-UFS SentiWordNet lexicon with Uni-gram feature scoring function in

equation 3.7.

7. SWN-BFS SentiWordNet lexicon with bi-gram feature scoring function in

equation 3.714.

13The data can be downloaded from: http://homepages.rpi.edu/~yuanb/thesis/thesis.html(Date Last Accessed, March 29, 2016)

14In n-gram, we included uni-gram to n-gram features.

31

8. SWN-TFS SentiWordNet lexicon with tri-gram Feature Scoring function in

equation 3.7.

9. VSL-BFS Vader Lexicon with basic feature scoring function in equation 3.9.

10. VSL-NFS Vader Sentiment Lexicon with normalized feature scoring function

in equation 3.10.

Rule-Based Methods We incorporated our Linguistic Rule Inference method in

algorithm 2 with top three lexicon-based methods.

1. BL-SWC-LIR Bing Liu’s lexicon with Simple Word Count approach with

LIR algorithm.

2. VSL-NFS-LIR Vader Lexicon with normalized featuring scoring function

and LIR algorithm.

3. VSL-BFS-LIR Vader Lexicon with basic featuring scoring function and LIR

algorithm.

Machine Learning-Based Methods We incorporated two sets of features in

three models. N-gram bag-of-words(BOW) feature and deeper linguistic feature.

N-gram bag-of-words(BOW) feature includes:

1. NB-NGRAM Naive Bayes classifier with uni-gram to 8-gram BOW features.

2. ME-NGRAM Maximum Entropy classifier with uni-gram to 8-gram BOW

features.

3. SVM-NGRAM Support Vector Machines classifier with uni-gram to 8-gram

BOW features.

For linguistic features, we chose bi-gram15 with Part-of-Speech(POS) features

that include the following methods.

1. NB-POS Naive Bayes classifier with bi-gram Part-of-Speech features.

15This is because bi-gram performs best in BOW experiment stage.

32

2. ME-POS Maximum Entropy classifier with bi-gram Part-of-Speech features.

3. SVM-POS Support Vector Machines classifier with bi-gram Part-of-Speech

features.

5. Discussion

5.1 Baseline

The results of two baseline algorithms are shown in Figure 5.1a and 5.1b

(a) Random Classifier(RC) (b) Most Frequent Classifier(MFC)

Figure 5.1: Results of Baseline Algorithms

RC achieved results of around 0.5000 across ten domains in all measurements.

This follows our expectation because the data-set is balanced across classes and

domains. MFC achieved a comparable accuracy and recall with RC. The recall is

always 0.5000. However, the Precision is very high(around 0.7500) while the F1 value

is low(around 0.3500). This also follows the trade-off between two measurements.

The average performance of two algorithms across all domains are shown in Table

5.1.

Table 5.1: Average Performance of Baseline Algorithms

Classifier Accuracy Precision Recall F1RC 0.5021 0.5021 0.5021 0.50203

MFC 0.5008 0.7504 0.5 0.33365

5.2 Lexicon-Based Methods

The results of ten lexicon-based methods are shown in Figures 5.2. The ten

lexicon-based classifiers demonstrated uneven classification capability across ten top-

ics. For overall accuracy, we can expect a roughly higher performance than baselines.

33

34

However, in terms of F1 measurement, which is another comprehensive evaluation

measurement, the results are mixed.

Table 5.2 shows the best performance for each classifier and Table 5.3 shows

the average performance. For accuracy, the best range is between 0.5700 to 0.6780,

higher than 0.5250 best baseline accuracy by RC. Generally average accuracy is

between 0.5060 and 0.5605. More optimistic results can be expected in terms of

precision, where the best performance exceeds 0.6269 and average precision is also

likely to achive 0.6045. The two measurements proves a generally adequate capabil-

ity for lexicon-based classifiers to predict “correctly” a certain portion of opinionated

tweets.

In terms of recall and F1 value, lexicon-based classifiers vary considerably.

While recall values vacillate from 0.5038 to 0.6489, which is slightly higher than

baseline, the overall F1 value can be as low as 0.3895, which is worse than base-

line and as high as 0.6714, which is satisfactory in certain case. This reveals that

although lexicon-based classifier can generally increase “correctness”, the ability to

“find all” and “win favour on all sides” is unpredictable.

Algorithm Accuracy Macro-Precision Macro-Recall Macro-F1MPQA-SWC 0.644 0.6668 0.6433 0.6307

GI-SWC 0.627 0.6269 0.6268 0.6268BL-SWC 0.678 0.6953 0.6789 0.6714

VSL-SWC 0.625 0.6352 0.6253 0.6249SWN-SWC 0.57 0.6535 0.5662 0.4965SWN-UFS 0.567 0.6964 0.5916 0.513SWN-BFS 0.588 0.6868 0.5916 0.5297SWN-TFS 0.582 0.6735 0.5805 0.5165VSL-BFS 0.644 0.6489 0.6442 0.644VSL-NFS 0.647 0.647 0.6471 0.647

Table 5.2: Best Performance of Lexicon-Based Methods Across Domains

In terms of lexicon, Bing Liu’s sentiment lexicon(BL-SWC) and Vader Senti-

ment Lexicon(VSL-BFS and VSL-NFS) are top three most effective lexicon-based

methods. The former uses Simple Word Count approach and the latter uses Feature

Scoring approach. Possible explanation for their performance is Bing Liu’s lexicon

is specially compiled on Internet corpus and Vader Sentiment Lexicon is also tai-

35

(a) MPQA-SWC (b) GI-SWC

(c) BL-SWC (d) SWN-SWC

(e) VSL-SWC (f) SWN-UFS

(g) SWN-BFS (h) SWN-TFS

(i) VSL-BFS (j) VSL-NFS

Figure 5.2: Results of Lexicon-Based Algorithms

36

Algorithm Accuracy Macro-Precision Macro-Recall Macro-F1MPQA-SWC 0.5095 0.50903 0.51032 0.48397

GI-SWC 0.5427 0.54608 0.54402 0.53706BL-SWC 0.593 0.60451 0.59403 0.58315

VSL-SWC 0.5524 0.55426 0.552 0.54238SWN-SWC 0.5061 0.51134 0.50383 0.3895SWN-UFS 0.5064 0.52256 0.50792 0.39377SWN-BFS 0.506 0.51201 0.50698 0.39186SWN-TFS 0.5117 0.52248 0.50682 0.39473VSL-BFS 0.5536 0.55782 0.55471 0.54536VSL-NFS 0.5605 0.56332 0.55973 0.55079

Table 5.3: Average Performance of Lexicon-Based Methods Across Domains

lored for sentiment analysis over social network data. Those lexicons require least

effort for domain adaptability and are likely to cover more occurrences of real-world

features in Internet language.

5.3 Rule-based Methods

The results of lexicon-based methods with Linguistic Inference Rule(LIR) algo-

rithm is shown in Figure 5.3. We chose top three lexicon-based methods: BL-SWC,

VSL-BFS and VSL-NFS.

As shown in Figures 5.5 and 5.4, LIR algorithm can boost the performance

to certain extent. For average evaluation, VSL-BFS and VSL-NFS all embraced

certain degree of increase in all measurements. For best performance, BL-SWC

and VSL-BFS all get an increase in all measurements and the precision of VSL-

BFS increased by 1%. However, we also see that for BL-SWC, none of the average

measurements increased with LIR. For VSL-BFS, with LIR only achieved a better

precision but accuracy, recall and F1 remains comparable with baseline. One note-

worthy performance is that for both VSL-NFS and BL-SWC methods, the best

precision increased and exceeded 0.7000.

The performance of LIR algorithm depends on both the data composition and

corresponding lexicon-based methods. From our experiment, we can infer that rule-

based methods can help increase precision and accuracy, but in terms of overall

performance, it’s efficiency still needs more examination.

37

(a) BL-SWC-LIR (b) VSL-NFS-LIR

(c) VSL-BFS-LIR

Figure 5.3: Results of Rule-Based Methods

(a) BL-SWC with/without LIR (b) VSL-BFS with/without LIR

(c) VSL-BFS with/without LIR

Figure 5.4: Comparison of Best Performance with LIR Algorithm

38

(a) BL-SWC with/without LIR (b) VSL-BFS with/without LIR

(c) VSL-NFS with/without LIR

Figure 5.5: Comparison of Average Performance with LIR Algorithm

5.4 Machine Learning-Based Methods

We used three state-of-the-art classifiers, namely Naive Bayes(NB), Maximum

Entropy(ME) and Support Vector Machines(SVM) together with two sets of fea-

tures.

The results of machine learning-based classifiers incorporating N-Gram Bag-

of-Words features with N ranging from 1(unigram) to 8 are shown by domain in

Figures 5.6, 5.7 and 5.8.

Generally, machine learning classifiers achieved very inspiring results in evalu-

ation. All of four measurements are very high compared to lexicon-based and rule-

based classifiers. Naive Bayes is one of the the simplest classifier yet it archived at

least 0.8589−0.8774 in average accuracy, 0.8605−0.8798 in precision, 0.8588−0.8774

in recall and 0.8771 − 0.8586 in F1 value. The best performance of NB classifier

could reach over 0.9500 in all measurements. A slightly higher results could be

expected for Maximum Entropy. While the best performance of ME classifier was

roughly the same as with NB, the average performance in all measurements were

supposed to be 1% higher than NB. SVM was the best classifier in overall perfor-

mance. The average measurements reached 0.8600− 0.8900 in general and the best

39

(a) NB-ART-BOW (b) NB-AUT-BOW

(c) NB-GAM-BOW (d) NB-GEN-BOW

(e) NB-ITC-BOW (f) NB-MOV-BOW

(g) NB-POL-BOW (h) NB-SOF-BOW

(i) NB-TEC-BOW (j) NB-TVS-BOW

Figure 5.6: Naive Bayes with N-Gram Bag-of-Words Features

40

(a) ME-ART-BOW (b) ME-AUT-BOW

(c) ME-GAM-BOW (d) ME-GEN-BOW

(e) ME-ITC-BOW (f) ME-MOV-BOW

(g) ME-POL-BOW (h) ME-SOF-BOW

(i) ME-TEC-BOW (j) ME-TVS-BOW

Figure 5.7: Maximum Entropy with N-Gram Bag-of-Words Features

41

(a) SVM-ART-BOW (b) SVM-AUT-BOW

(c) SVM-GAM-BOW (d) SVM-GEN-BOW

(e) SVM-ITC-BOW (f) SVM-MOV-BOW

(g) SVM-POL-BOW (h) SVM-SOF-BOW

(i) SVM-TEC-BOW (j) SVM-TVS-BOW

Figure 5.8: Support Vector Machines with N-Gram Bag-of-Words Features

42

(a) NB-BOW-AVG (b) ME-BOW-AVG

(c) SVM-BOW-AVG

Figure 5.9: Average Performance of N-Gram Bag-of-Words Features

measurements all exceeded 0.9600.

Our Bag-of-Words feature ranges from uni-gram to 8-gram. Based on our

observation, most data-sets reach the best performance with bi-gram as depicted in

Figures 5.6 to 5.8. For NB classifier, 6 out of 10 topics favor bi-gram over others. For

ME and SVM, the numbers are 5 and 8. In terms of average performance, bi-gram

definitely dominated all other n-gram features as depicted in Figure 5.9.

From our experiment, we can conclude that normally uni-gram features is effec-

tive and bi-gram are most effective BOW features for multi-domain twitter sentiment

analysis. Besides, SVM performed best in terms of all common measurements.

To further experiment on machine learning-based classifiers, we incorporated

rich linguistic features-Part-of-Speech(POS). For simplicity, we only conducted ex-

periment with bi-gram features. The results are shown in Figure 5.10. The results

are improved compared to Bag-of-Words features for all three classifiers as depicted

in Figure 5.11. Across all domains, all of four average measurements increased by

approximately 0.0100 to 0.0200. For twitter data, the POS tagging can be both

inefficient and error-prone. The improvement is relatively small considering Part-

of-Speech tagging is time-consuming.

43

(a) NB with POS Features (b) ME with POS Features

(c) SVM with POS Features

Figure 5.10: Average Performance of Machine Learning Classifiers with LinguisticFeatures

(a) NB with POS and BOW Features (b) ME with POS and BOW Features

(c) SVM with POS and BOW Features

Figure 5.11: Comparison of Linguistic and Bag-of-Words Features

44

5.5 Evaluation Revisited

From our discussion, it appears that machine learning-based methods far out-

perform lexicon-based and rule-based methods in almost all evaluation measure-

ments. Even the most simple machine learning model can achieve 35% higher score

than a fine-tuned lexicon-based or rule-based classifiers. However, a closer exami-

nation of our problem might raise a new question: is a classifier with high accuracy

really accurate?

For classification problems with relatively “strict” theoretical grounding and

boundaries, such as text categorization, protein functional categorization, face de-

tection etc., it is true that higher accuracy means better systems. However, the

problem of sentiment analysis entails large amount of subjectivity. Practically it is

hard to quantify the intensity of emotion or opinion. For single words like “good”

and “excellent”, we could conclude that the former is not as “strong” as the latter.

But when it comes to “terrible”, “dreadful” and “horrible”, the distinction seems

more vague than we can easily distinguish.

Further, a study by social science researchers reveals the complexity of this

problem hinges on many factors16. In this study, the author believes that evalua-

tion measurements are merely the percentage of times that human judgment agrees

with the system. However, another issue under the hood is human concordance,

which refers to the evaluation of agreement among humans annotators. The author

cites study by business companies that in sentiment analysis human concordance is

roughly 70% up to 79%. Given such a fact, a system is very likely to end up in a

perfect 100% accuracy while there exists a portion of 30% of disagreement with a

random human individual.

In summary, the idea of using precision, recall, F1 value and accuracy to

measure sentiment analysis is somewhat out of expediency. The ultimate goal of

sentiment analysis is to endow computers the ability to “feel” the emotion and act

with sentiment like humans. If humans are not capable of telling 100% correctly

sentiment in natural language, what should we expect from computers?

We discuss these problems here in hope that these thoughts would shed light

16http://brnrd.me/social-sentiment-sentiment-analysis/ (Date Last Accessed, March 29, 2016)

45

on interesting aspects of sentiment analysis. Further efforts should be taken to

scrutinize more closely such issues.

6. Conclusion

In this work, we explored three mainstream methodologies for sentiment analysis

over Twitter data, namely lexicon-based methods, rule-based methods and machine

learning-based methods.

Our major contributions are three-folds. First, we extensively study popular

sentiment lexicons and apply them with both Simple Word Count and Feature

Scoring approaches. Bing Liu’s Lexicon and Vader Sentiment Lexicon are proved to

be effective in Twitter sentiment analysis. Secondly, we proposed a set of Linguistic

Inference Rules. Those rules can help handle negation, valence shifter and contrast

in a natural language text. Our LIR rules help improve the precision and accuracy

of Twitter sentiment analysis. Last but not least, we compared two sets of features,

Bag-of-Words N-gram and Linguistic Features with state-of-the-art machine learning

classifiers. Bag-of-Words feature is simple yet effective. Bi-gram BOW feature

achieved best performance with all three models. Linguistic features could help

improve performance by slight percentage.

Two problems raised our attention. First, whether it is legitimate for us to

evaluate sentiment classification using precision, recall, F1 value and accuracy. Sec-

ond, whether it is worthy of the time and effort to apply rich linguistic features to

sentiment analysis considering the improvement is not very significant.

In the future, we would study further many related problems. On the one

hand, we would like to compare Twitter sentiment analysis with other domains.

Effective unsupervised or lexicon-based classifiers, domain adaptability, feature se-

lection are all relevant topics that need further research. On the other, given an

efficient sentiment analysis algorithm, we would like to see how it can be applied in

solving real-world problems. For example, predicting presidential election, estimat-

ing product reputation and movie rating etc. Furthermore, we would also dive into

the engineering aspects of Twitter sentiment analysis. Optimization and scalable

algorithms for big data are issues that need to be solved in the not-so-distant future.

46

LITERATURE CITED

[1] S. Muralidharan, L. Rasmussen, D. Patterson, and J.-H. Shin, “Hope for

haiti: an analysis of facebook and twitter usage during the earthquake relief

efforts,” Public Relations Rev., vol. 37, no. 2, pp. 175–177, Jun. 2011.

[2] D. L. Cogburn and F. K. Espinoza-Vasquez, “From networked nominee to

networked nation: examining the impact of web 2.0 and social media on

political participation and civic engagement in the 2008 obama campaign,” J.

Political Marketing, vol. 10, no. 1-2, pp. 189–213, Feb. 2011.

[3] Merriam-Webster, Merriam-Webster’s Collegiate Dictionary. Springfield,

MA: Merriam-Webster, 2004.

[4] B. Liu, “Sentiment analysis and opinion mining,” Synthesis Lectures on

Human Lang. Tech., vol. 5, no. 1, pp. 1–167, Apr. 2012.

[5] A. Farzindar and D. Inkpen, “Natural language processing for social media,”

Synthesis Lectures on Human Lang. Tech., vol. 8, no. 2, pp. 1–166, Sept. 2015.

[6] C. Strapparava and R. Mihalcea, “Learning to identify emotions in text,” in

Proc. the 2008 ACM Symp. Appl. Comput., Cear, Brazil, 2008, pp. 1556–1560.

[7] P. D. Turney, “Thumbs up or thumbs down?: semantic orientation applied to

unsupervised classification of reviews,” in Proc. 40th Annu. Meeting on Assoc.

for Computational Linguistics, Philadelphia, PA, 2002, pp. 417–424.

[8] B. Yang and C. Cardie, “Context-aware learning for sentence-level sentiment

analysis with posterior regularization.” in Proc. 52nd Annu. Meeting on

Assoc. for Computational Linguistics, Baltimore,MD, 2014, pp. 325–335.

[9] M. Hu and B. Liu, “Mining and summarizing customer reviews,” in Proc.

10th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining,

Seattle, WA, 2004, pp. 168–177.

47

48

[10] T. Wilson, J. Wiebe, and P. Hoffmann, “Recognizing contextual polarity in

phrase-level sentiment analysis,” in Proc. Conf. Human Lang. Tech. and

Empirical Methods in Natural Lang. Process., Vancouver, B.C., Canada, 2005,

pp. 347–354.

[11] T. Wilson, J. Wiebe, and R. Hwa, “Just how mad are you? finding strong and

weak opinion clauses,” in Proc. 19th Nat. Conf. Aitificial Intell., vol. 4, San

Jose, CA, 2004, pp. 761–769.

[12] Z. Zhang, D. Miao, and B. Yuan, “Context-dependent sentiment classification

using antonym pairs and double expansion,” in Web-Age Inform. Manage.,

Macau, China, 2014, pp. 711–722.

[13] N. Jindal and B. Liu, “Mining comparative sentences and relations,” in Proc.

21st Nat. Conf. Aitificial Intell., Boston, MA, 2006, pp. 1331–1336.

[14] A. Esuli and F. Sebastiani, “Sentiwordnet: a publicly available lexical resource

for opinion mining,” in Proc. Lang. Resources and Evaluation Conf., Genoa,

Italy, 2006, pp. 417–422.

[15] P. J. Stone, D. C. Dunphy, and M. S. Smith, “The general inquirer: a

computer approach to content analysis.” in Proc. Spring Joint Comput. Conf.,

New York, NY, 1966, pp. 241–256.

[16] C. Hutto and E. Gilbert, “A parsimonious rule-based model for sentiment

analysis of social media text,” in 8th Int. Conf. Weblogs and Social Media,

Ann Arbor, MI, 2014, pp. 216–225.

[17] S.-M. Kim and E. Hovy, “Identifying and analyzing judgment opinions,” in

Proc. Main Conf. on Human Lang. Tech. Conf. North Amer. Chapter of the

Assoc. of Computational Linguistics, New York, NY, 2006, pp. 200–207.

[18] V. Hatzivassiloglou and K. R. McKeown, “Predicting the semantic orientation

of adjectives,” in Proc. 35th Assoc. Computational Linguistics and 8th Conf.

European Chapter of the Assoc. Computational Linguistics, Madrid, Spain,

1997, pp. 174–181.

49

[19] H. Kanayama and T. Nasukawa, “Fully automatic lexicon expansion for

domain-oriented sentiment analysis,” in Proc. 2006 Conf. on Empirical

Methods in Natural Lang. Process., Sydney, Australia, 2006, pp. 355–363.

[20] L. Augustyniak, P. Szymanski, T. Kajdanowicz, and W. Tulig lowicz,

“Comprehensive study on lexicon-based ensemble classification sentiment

analysis,” Entropy, vol. 18, no. 1, p. 4, Dec. 2015.

[21] L. Augustyniak, T. Kajdanowicz, P. Szymanski, W. Tuliglowicz, P. Kazienko,

R. Alhajj, and B. Szymanski, “Simpler is better? lexicon-based ensemble

sentiment classification beats supervised methods,” in Proc. IEEE/ACM Int.

Conf. Advances in Social Network Anal. and Mining, Beijing, China, 2014,

pp. 924–929.

[22] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up?: sentiment classification

using machine learning techniques,” in Proc. ACL Conf. on Empirical

Methods in Natural Lang. Process., vol. 10, Philadelphia, PA, 2002, pp. 79–86.

[23] S. Tan, X. Cheng, Y. Wang, and H. Xu, “Adapting naive bayes to domain

adaptation for sentiment analysis,” in Adv. in Inform. Retrieval, Toulouse,

France, 2009, pp. 337–349.

[24] M. Gamon, “Sentiment classification on customer feedback data: noisy data,

large feature vectors, and the role of linguistic analysis,” in Proc. 20th Int.

Conf. on Computational Linguistics, Barcelona, Spain, 2004, pp. 841–847.

[25] T. Mullen and N. Collier, “Sentiment analysis using support vector machines

with diverse information sources.” in Proc. Empirical Methods in Natural

Lang. Process., Barcelona, Spain, 2004, pp. 412–418.

[26] S. Li, S. Y. M. Lee, Y. Chen, C.-R. Huang, and G. Zhou, “Sentiment

classification and polarity shifting,” in Proc. 23rd Int. Conf. on

Computational Linguistics, Uppsala, Sweden, 2010, pp. 635–643.

[27] B. Pang and L. Lee, “A sentimental education: Sentiment analysis using

subjectivity summarization based on minimum cuts,” in Proc. 42nd Annu.

50

Meeting on Assoc. for Computational Linguistics, Barcelona, Spain, 2004, p.

271.

[28] F. Li, C. Han, M. Huang, X. Zhu, Y.-J. Xia, S. Zhang, and H. Yu,

“Structure-aware review mining and summarization,” in Proc. 23rd Int. Conf.

Computational Linguistics, Uppsala, Sweden, 2010, pp. 653–661.

[29] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts,

“Learning word vectors for sentiment analysis,” in Proc. 49th Annu. Meeting

of Assoc. for Computational Linguistics: Human Lang. Tech., Portland, OR,

2011, pp. 142–150.

[30] F. Sebastiani, “Machine learning in automated text categorization,” ACM

Comput. Surveys, vol. 34, no. 1, pp. 1–47, Mar. 2002.

[31] M. Taboada, J. Brooke, M. Tofiloski, K. Voll, and M. Stede, “Lexicon-based

methods for sentiment analysis,” Computational Linguistics, vol. 37, no. 2, pp.

267–307, Sept. 2011.

[32] J. Allen, Natural Language Understanding. Upper Saddle River, NJ:

Pearson, 1987.

[33] J. Ramos, “Using tf-idf to determine word relevance in document queries,” in

Proc. 1st Instructional Conf. Mach. Learn., Washington D.C., 2003.

[34] D. Jurafsky, Speech & Language Processing. Upper Saddle River, NJ:

Prentice Hall, 2008.

[35] A. K. McCallum, “Mallet: A machine learning for language toolkit,” 2002.

[Online]. Available: http://mallet.cs.umass.edu (Date Last Accessed: March

29, 2016)

[36] C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel, S. Bethard, and

D. McClosky, “The stanford corenlp natural language processing toolkit.” in

Annu. Meeting on Assoc. for Computational Linguistics Syst.

Demonstrations, Baltimore, MD, 2014, pp. 55–60.

51

[37] C.-C. Chang and C.-J. Lin, “Libsvm: a library for support vector machines,”

ACM Trans. Intelligent Syst. and Tech., vol. 2, no. 3, p. 27, Apr. 2011.

[38] M. Sun, Z. Liu, M. Zhang, and Y. Liu, Chinese Computational Linguistics

and Natural Lang. Process. Based on Naturally Annotated Big Data. Berlin,

Germany: Springer, 2015.

APPENDIX A

Linguistic Resources

To address linguistic phenomenon that lexicon-based classifiers fail to handle, we

have created a series of rules. In this appendix, we will present some of the linguistic

resources that we collected. The negation expressions, valence shifter expressions are

taken from Vader [16] source code17. Contrasting Conjunctions shown are collected

manually.

Table A.1: Valence Shifter Expressions

Type WordsNegation aint, arent, cannot, cant, couldnt, darent, didnt,

doesnt, ain’t, aren’t, can’t, couldn’t, daren’t, didn’t,doesn’t, dont, hadnt, hasnt, havent, isnt, mightnt,mustnt, neither, don’t, hadn’t, hasn’t, haven’t, isn’t,mightn’t, mustn’t, neednt, needn’t, never, none, nope,nor, not, nothing, nowhere, oughtnt, shant, shouldnt,uhuh, wasnt, werent, oughtn’t, shan’t, shouldn’t, uh-uh, wasn’t, weren’t, without, wont, wouldnt, won’t,wouldn’t, rarely, seldom, despite

IntensifyingShifters

absolutely, amazingly, awfully, completely, considerably,decidedly, deeply, effing, enormously, entirely, especially,exceptionally, extremely, fabulously, flipping, flippin, ,frickin, frigging, friggin, fully, fucking, greatly, hella,highly, hugely, incredibly, intensely, majorly, more,most, particularly, purely, quite, really, remarkably, so,substantially, thoroughly, totally, tremendously, uber,unbelievably, unusually, utterly, very

WeakeningShifters

almost, barely, hardly, just enough, kind of, kinda,kindof, kind-of, less, little, marginally, occasionally,partly, scarcely, slightly, somewhat, sort of, sorta, sortof,sort-of

ContrastingConjunctions

but, although, though, even though, even if, however

17https://github.com/cjhutto/vaderSentiment (Date Last Accessed, March 29, 2016)

52

sentiment analysis of twitter data - computer …szymansk/theses/bo.ms.16.pdf · sentiment analysis...

Documents