a replication study of the top performing systems in semeval twitter sentiment analysis

21
A Replication Study of the Top Performing Systems in SemEval Twitter Sentiment Analysis Efstratios Sygkounas, Giuseppe Rizzo, Raphaël Troncy <[email protected] > @rtroncy

Upload: raphael-troncy

Post on 20-Jan-2017

279 views

Category:

Internet


1 download

TRANSCRIPT

Page 1: A replication study of the top performing systems in SemEval twitter sentiment analysis

A Replication Study of the Top Performing Systems in SemEval

Twitter Sentiment Analysis

Efstratios Sygkounas, Giuseppe Rizzo, Raphaël Troncy

<[email protected]>

@rtroncy

Page 2: A replication study of the top performing systems in SemEval twitter sentiment analysis

Replications Study1

Replicability Repeating a previous result under the original conditions

(e.g. same system configuration and datasets)

Reproducibility Reproducing a previous result under different, but

comparable conditions

Generalizability Applying an existing, empirically validated technique to a

different task/domain than the original one

21/10/2016 - 15th International Semantic Web Conference (ISWC), Kobe, Japan - 2

1 Hasibi, F., Balog, K., Bratsberg, S.E. On the Reproducibility of the TAGME Entity Linking System. 38th European Conference on Information Retrieval (ECIR), 2016

Page 3: A replication study of the top performing systems in SemEval twitter sentiment analysis

SemEval 2013-20152

Task: Sentiment analysis in Twitter

2013 Task 2

2014 Task 9

2015 Task 10

Subtask A Contextual

Polarity disambiguation

Subtask B Message Polarity

Classification

Subtask C Topic-Based

Message Polarity

Classification

Subtask D Detecting

Trends Towards a

Topic

Subtask E Determining strength of

association of Twitter terms with positive sentiment

2 Rosenthal, S., Nakov, P., Kiritchenko, S., Mohammad, S., Ritter, A., Stoyanovm, V. SemEval-2015 Task 10: Sentiment Analysis in Twitter. 9th International Workshop on Semantic Evaluation (SemEval), 2015

http://alt.qcri.org/semeval2015

21/10/2016 - 15th International Semantic Web Conference (ISWC), Kobe, Japan - 3

Page 4: A replication study of the top performing systems in SemEval twitter sentiment analysis

SemEval Subtask B (started in 2013)

Annotations performed by Amazon Mechanical Turkers Tweets are classified in 3 classes

Positive Neutral Negative

21/10/2016 - 15th International Semantic Web Conference (ISWC), Kobe, Japan - 4

Page 5: A replication study of the top performing systems in SemEval twitter sentiment analysis

SemEval Subtask B

Tweet ID Gold Standard ID

Gold Standard Tweet

522855301876580353

T15111159 positive I've been watching Gilmore Girls for the past 3 hours. Oops, happy Thursday!

523087448264671233

T15111142 neutral My Friday consists of Netflix and hot tea allllllllll day long.

522960120683429889

T15111318 negative Kobe Bryant smiling as he re-enters the game with the Lakers losing 91-63 in the 4th quarter. Probably insanity settling in.

- 5 21/10/2016 - 15th International Semantic Web Conference (ISWC), Kobe, Japan

Page 6: A replication study of the top performing systems in SemEval twitter sentiment analysis

SemEval Subtask B Systems scored according to the F1 measure 2015: ~40 systems competing Webis

Team

SemEval 2015

Hagen, M., Potthast, M., Buchner, M., Stein, B.: Webis: An Ensemble for Twitter Sentiment Detection. International Workshop on Semantic Evaluation (SemEval), 2015

- 6 21/10/2016 - 15th International Semantic Web Conference (ISWC), Kobe, Japan

Page 7: A replication study of the top performing systems in SemEval twitter sentiment analysis

- 7 21/10/2016 - 15th International Semantic Web Conference (ISWC), Kobe, Japan

Ensemble Learning that combines different classifiers with different settings

Page 8: A replication study of the top performing systems in SemEval twitter sentiment analysis

Webis System Webis’s system is an ensemble of 4 classifiers NRC-

CANADA

GU-MLT-LT

KLUE

TeamX

SemEval 2013

SemEval 2014

- 8 21/10/2016 - 15th International Semantic Web Conference (ISWC), Kobe, Japan

Page 9: A replication study of the top performing systems in SemEval twitter sentiment analysis

Webis System System Classifier Features

NRC-Canada

Support Vector Machine(SVM)

n-grams, alcaps, POS, polarity dictionaries, punctuation marks, emoticons, word lengthening, clusters and negation

GU-MLT-LT Linear regression normalized uni-grams, stems, clustering and negation

KLUE Maximum Entropy

unigrams, bigrams, and an extended unigram model that includes a simple treatment of negation

TeamX Logistic Regression (LIBLINEAR)

word n-grams, character n-grams, clusters and word senses

- 9 21/10/2016 - 15th International Semantic Web Conference (ISWC), Kobe, Japan

Page 10: A replication study of the top performing systems in SemEval twitter sentiment analysis

Webis System System Language resources

NRC-Canada

NRC Emotion, MPQA, Bing Liu’s Opinion Lexicon, NRC Hashtag Sentiment and the Sentiment140

GU-MLT-LT Polarity Dictionary and SentiWordNet

KLUE

SentiStrength, extended version of AFINN-111, large-vocabulary distributional semantic models (DSM) from English Wikipedia and Google Web 1T 5-Grams databases

TeamX

Formal: MPQA Subjectivity Lexicon, General Inquirer and SentiWordNet Informal: AFINN-111, Bing Liu’s Opinion Lexicon, NRC Hashtag, Sentiment Lexicon and Sentiment140 Lexicon

- 10 21/10/2016 - 15th International Semantic Web Conference (ISWC), Kobe, Japan

Page 11: A replication study of the top performing systems in SemEval twitter sentiment analysis

Replicability

Download Webis’s already trained models and code https://github.com/webis-de/ECIR-2015-and-SEMEVAL-2015

Download SemEval’s datasets via the Twitter API (some tweets not available anymore)

- 11 21/10/2016 - 15th International Semantic Web Conference (ISWC), Kobe, Japan

Page 12: A replication study of the top performing systems in SemEval twitter sentiment analysis

Replicability

Versioning is an important aspect to be considered in any replication study

We replaced the Stanford NLP Core old libraries with the newest ones

21/10/2016 - 15th International Semantic Web Conference (ISWC), Kobe, Japan - 12

Dataset Claimed in paper

Webis’s models

Replicate Webis system on test 2013 68.49 69.62

Replicate Webis system on test 2014 70.86 66.65

Replicate Webis system on test 2015 64.84 66.17

Page 13: A replication study of the top performing systems in SemEval twitter sentiment analysis

Reproducibility Dataset Claimed in

paper Webis’s models

Our models

Replicate Webis system on test 2013 68.49 69.62 70.06

Replicate Webis system on test 2014 70.86 66.65 69.31

Replicate Webis system on test 2015 64.84 66.17 66.57

Replicate Webis system - TeamX on test 2013 N/A 69.04 70.34

Replicate Webis system - TeamX on test 2014 N/A 66.51 68.56

Replicate Webis system - TeamX on test 2015 N/A 65.58 66.19

Our models have better performance in general

SentiME without TeamX performs worst for 2014’s and 2015’s but not for 2013’s dataset

21/10/2016 - 15th International Semantic Web Conference (ISWC), Kobe, Japan - 13

Page 14: A replication study of the top performing systems in SemEval twitter sentiment analysis

Generalization

21/10/2016 - 15th International Semantic Web Conference (ISWC), Kobe, Japan - 14

SentiME Consisted by 4 classifiers

+ Stanford Sentiment System We train our models using bagging in order to boost

the training of the ensemble

We noticed a lot of commonalities in TeamX’s and Stanford’s Sentiment System features, so we decided to perform test with/without TeamX in order to assess the classifier's contribution

Page 15: A replication study of the top performing systems in SemEval twitter sentiment analysis

Stanford Sentiment System

Stanford Sentiment System is a recursive neural tensor network parsed by the Stanford Tree Bank Stanford Sentiment System can capture the meaning

of compositional phrases which is hard to be achieved by the normal bag of words approaches

● Classifies a sentence in 5 classes (very positive, positive, neutral, negative and very negative)

● We use the pre-trained models Stanford team provides

- 15 21/10/2016 - 15th International Semantic Web Conference (ISWC), Kobe, Japan

Page 16: A replication study of the top performing systems in SemEval twitter sentiment analysis

Bagging

Due to the fact that bagging introduces some randomness into the training process, and that the size of the bootstrap samples are not fixed, we decide to perform multiple experiments with different sizes ranging from 33% to 175%

We observed that doing bagging with 150% of the initial dataset size leads to the best performance in terms of F1 score

- 16 21/10/2016 - 15th International Semantic Web Conference (ISWC), Kobe, Japan

Page 17: A replication study of the top performing systems in SemEval twitter sentiment analysis

SentiME System

- 17 21/10/2016 - 15th International Semantic Web Conference (ISWC), Kobe, Japan

Page 18: A replication study of the top performing systems in SemEval twitter sentiment analysis

Generalization

1. Webis replicate system: this is the replicate of the

Webis system using re-trained models

2. SentiME system: the system we propose

3. Webis replicate system without TeamX

4. SentiME system without TeamX

We performed four different experiments to evaluate the performance of SentiME compare to our previous replicate of the Webis system

- 18 21/10/2016 - 15th International Semantic Web Conference (ISWC), Kobe, Japan

Page 19: A replication study of the top performing systems in SemEval twitter sentiment analysis

Generalization

System SemEval2014- test

SemEval2014- sarcasm

SemEval2015- test

SemEval2015- sarcasm

Webis Replicate system 69.31 60.00 66.57 54.19

SentiME system 68.27 62.57 67.39 60.92

Webis replicate system without TeamX

68.56 62.04 66.19 56.86

SentiME system without TeamX

69.27 62.04 66.38 58.92

Webis 70.86 49.33 64.84 53.59

• SentiME outperforms Webis Replicate system on all datasets except SemEval2014-test

• SentiME improves the F score by respectively 2,5% and 6,5% on SemEval2014-sarcasm and SemEval2015-sarcasm datasets

• On the SemEval2014-sarcasm dataset there is a significant difference of performance between the original Webis system (49.33%) and our replicate (60%).

- 19 21/10/2016 - 15th International Semantic Web Conference (ISWC), Kobe, Japan

Page 20: A replication study of the top performing systems in SemEval twitter sentiment analysis

Summary

Stanford Sentiment System is heavily skew towards negative classification We manage to improve the Webis system by 1% in the

general case by introducing a fifth sub-classifier (the Stanford Sentiment System) and by boosting the training with bagging 150% The SentiME system also outperforms the Webis

system by 6,5% on the particular and more difficult sarcasm dataset (thanks to Stanford classifier)

- 20 21/10/2016 - 15th International Semantic Web Conference (ISWC), Kobe, Japan

Page 21: A replication study of the top performing systems in SemEval twitter sentiment analysis

Some Lessons Learned

Availability of source code AND models significantly helps to perform reproducibility study Pre-trained models provided by Webis are not exactly

the same than the re-trained models we have created from the data at disposal You have to archive data … and software libraries !

It is possible that Webis’s authors did not detail the full

set of features they have used

- 21 21/10/2016 - 15th International Semantic Web Conference (ISWC), Kobe, Japan

https://github.com/MultimediaSemantics/sentime