sentiment analysis of twitter data

26
Sentiment Analysis of Twitter Data

Upload: nurendra-choudhary

Post on 22-Jan-2017

349 views

Category:

Data & Analytics


5 download

TRANSCRIPT

Page 1: Sentiment analysis of Twitter Data

Sentiment Analysis of Twitter Data

Page 2: Sentiment analysis of Twitter Data

Hello!We are Team 10Member 1:Name: Nurendra ChoudharyRoll Number: 201325186

Member 2:Name: P Yaswanth Satya Vital VarmaRoll Number: 201301064

Page 3: Sentiment analysis of Twitter Data

Introduction:

Twitter is a popular microblogging service where users create status messages (called "tweets").

These tweets sometimes express opinions about different topics.

Generally, this type of sentiment analysis is useful for consumers who are trying to research a product or service, or marketers researching public opinion of their company.

Page 4: Sentiment analysis of Twitter Data

AIM OF THE PROJECT

The purpose of this project is to build an algorithm that can accurately

classify Twitter messages as positive or negative, with respect to a query

term.

Our hypothesis is that we can obtain high accuracy on classifying

sentiment in Twitter messages using machine learning techniques.

Page 5: Sentiment analysis of Twitter Data

The details of the dataset used for training the Classifier.

1.Dataset

Page 6: Sentiment analysis of Twitter Data

1600000 sentences annotated as positive, negative.http://help.sentiment140.com/for-students/

Sentiment140 Dataset

Page 7: Sentiment analysis of Twitter Data

Pre-Processing the raw data to increase ease for the classifier.

2.Pre-Processing

Page 8: Sentiment analysis of Twitter Data

➜ Case Folding of the Data (Turning everything to lowercase)

➜ Punctuation Removal from the data

➜ Common Abbreviations and Acronyms expanded.

➜ HashTag removal.

Steps in Preprocessing:

Page 9: Sentiment analysis of Twitter Data

Includes Language Models(Unigram,Bigram),

Lexical Scoring, Machine Learning Scores, Emoticon Scores.

2.The Main System

Page 10: Sentiment analysis of Twitter Data

2.1 Training Distributed Semantic Representation (Word2Vec Model)

➜ We use a Python Module called gensim.models.word2vec for this.

➜ We train a model using only the sentences (after preprocessing) from the corpus.

➜ This generates vectors for all the words in the corpus.

➜ This model can now be used to get vectors for the words.

➜ For unknown words, we use the vectors of words with frequency one.

Page 11: Sentiment analysis of Twitter Data

2.2 Language Model

Unigram

The word vectors are taken individually to train.

E.g: I am not dexter.Is taken as:[I, am, not, dexter]

Bigram

The word vectors are taken two at a time to train.

E.g: I am not dexter.Is taken as:[(I,am), (am,not), (not,dexter)]

Unigram + Bigram

Use unigram normally but bigram when words reversing sentiments like not,no,etc are present.

E.g: I am not dexter.Is taken as:[I,am,(not,dexter)]

Page 12: Sentiment analysis of Twitter Data

2.3 Training For Machine Learning Scores

1. Use the various language models and train various two-class classifiers for results.

2. The classifiers we used are:a. Support Vector Machines - Scikit Learn Pythonb. Multi Layer Perceptron Neural Network - Scikit

Learn Pythonc. Naive Bayes Classifier - Scikit Learn Pythond. Decision Tree Classifier - Scikit Learn Pythone. Random Forest Classifier - Scikit Learn Pythonf. Logistic Regression Classifier - Scikit Learn

Pythong. Recurrent Neural Networks - PyBrain module

Python

Page 13: Sentiment analysis of Twitter Data

Logistic Regression:Logistic regression is a powerful statistical way of modeling a binomial outcome (takes the value 0 or 1 like having or not having a disease) with one or more explanatory variables.

Naive Bayes Classifier:Try solving the problem with a simple classifier.

Multi-Layer Perceptron Neural Network Classifier:The method has significantly increased results in binary classification compared to classical classifiers.

Recurrent Neural Networks:This class of neural networks have significantly improved results for various Natural Language Processing Problems. Hence, this was tried too.

2.3.1 Reasons for using the Classifiers

Page 14: Sentiment analysis of Twitter Data

Decision Trees:Decision trees are very intuitive and easy to explain. Decision trees do not require any assumptions of linearity in the data.

Random Forest:Decision Trees tend to overfit. Hence an ensemble of them gives a much better output for unseen data.

Support Vector Machines:This classifier has been proven by a lot of research papers to give the best result among the classical classifiers.

2.3.1 Reasons for using the Classifiers

Page 15: Sentiment analysis of Twitter Data

2.3.2 Accuracies of Various Approaches(Accuracies are calculated using 5-fold cross-validation)

Unigram Bigram Unigram + Bigram

Support Vector Machines 71.1%

-NA-(Takes too much

time to train, stopped after 28

hours)

74.3%

Naive Bayes Classifier 64.2% 62.8% 65.0%

Logistic Regression 67.4% 72.1% 71.6%

Page 16: Sentiment analysis of Twitter Data

2.3.2 Accuracies of Various Approaches(Accuracies are calculated using 5-fold cross-validation)

Unigram Bigram Unigram + Bigram

Decision Trees 60.4% 60.0% 61.5%

Random Forest Classifier 67.1% 70.8% 71.3%

Multi-Perceptron Neural Network

Classifier68.6% 72.7% 74%

Page 17: Sentiment analysis of Twitter Data

2.3.2 Accuracies of Various Approaches(Accuracies are calculated using 5-fold cross-validation)

Unigram Bigram Unigram + Bigram

Recurrent Neural Networks 69.1% 70.4% 71.5%

Page 18: Sentiment analysis of Twitter Data

2.3.4 Based on the Above Results:

We chose Unigram+Bigram with Random Forest Classifier to be the part of our system as they gave the best results.

Page 19: Sentiment analysis of Twitter Data

Emoticons play a major role in deciding the sentiment of a sentence, hence

Emoticon Scoring

Page 20: Sentiment analysis of Twitter Data

Emoticon Scoring

Use a dictionary to

score the emoticons.

Use this emoticon score in the model.

Search for Emoticons in the given text

using RegEx or find.

Page 21: Sentiment analysis of Twitter Data

Get the text

Lexical Scoring(Scoring based on words of the text)

Lemmatize the text

The Score will be used in the final system.

This will be given more weightage as this is more definite

Score the Lemmatized text

using dictionaries

Page 22: Sentiment analysis of Twitter Data

Training Classifier and Word2Vec Model

Preprocessing Train Word2Vec Model

Annotated Training Data

Sentence VectorSentences Classifier ModelTrain using various classifier algorithms

Page 23: Sentiment analysis of Twitter Data

The Overall Scoring Process Goes Like This

Lexical Scores

Classifier Scores

Emoticon Scores

Unigram

Bigram

Sentences Overall Scores

Weight of Lexical Scores

Weight of Emoticons

Page 24: Sentiment analysis of Twitter Data

Challenges in the Approach

Randomness in DataTwitter is written by Users, hence it is not very formal.

EmoticonsLots of types of emoticons with new ones coming very frequently.

AbbreviationsUsers use a lot of abbreviation slangs like AFAIK, GN, etc.Capturing all of them is difficult.

Grapheme StretchingEmotions expressed through stretching of normal words.Like, Please -> Pleaaaaaseeeeee

Reversing WordsSome words completely reverse sentiment of another word.E.g: not good == opposite(good)

Technical ChallengesClassifiers take a lot of time to train, hence silly mistakes cost a lot of time.

Page 25: Sentiment analysis of Twitter Data

Future Improvements

➜ Handle Grapheme Stretching➜ Handle authenticity of Data and Users➜ Handle Sarcasm and Humor