sentiment analysis on amazon movie reviews dataset
TRANSCRIPT
Sentiment Analysis May 9, 2016
Page 1 of 21
IS 688 : Web Mining – Spring 2016
Sentiment Analysis Amazon Movie Reviews Dataset Project Report – Group 1 Professor: Christopher Markson
Team : Amit | Maham | Mashael | Karan | Nidhish 5-9-2016
Sentiment Analysis May 9, 2016
Page 2 of 21
Table of Contents
Acknowledgement .......................................................................................................................... 3
Abstract........................................................................................................................................... 4
Problem Statement ........................................................................................................................ 5
Introduction .................................................................................................................................... 6
Data Collection ............................................................................................................................... 8
Dataset Source and Format: ........................................................................................................ 8
Problem ....................................................................................................................................... 9
Solution ....................................................................................................................................... 9
Parsing the Data in R-Friendly Format: .................................................................................. 9
Getting additional supporting data: ..................................................................................... 10
Model Selection ............................................................................................................................ 12
Getting basic sentimental score for each review. ..................................................................... 12
Creating word cloud for every movie. ....................................................................................... 13
Determining Point-wise Mutual Information (PMI) sentiment score for each movie. ............. 13
Aggregating all the sentiment scores. ....................................................................................... 14
Assigning an overall sentiment score to each movie. ............................................................... 14
Result Overview ............................................................................................................................ 15
Value Obtained .......................................................................................................................... 18
Achievements ............................................................................................................................ 18
Scope for Improvement ................................................................................................................ 19
Identification of accurate review analysis through Plot Trajectory: ......................................... 19
Word Clouds based on a certain Part-of-Speech ...................................................................... 19
Other ......................................................................................................................................... 20
Citation ......................................................................................................................................... 21
Sentiment Analysis May 9, 2016
Page 3 of 21
Acknowledgement
We would like to express our gratitude and appreciation to Professor Christopher Markson,
who doesn’t only gave us the responsibility to complete this report, but also helped us in
completing it.
A special thanks to him for R-Bloggers in order to complete this project report successfully.
We also would also like to acknowledge our act of gratitude to all the R-Packages, R-Bloggers
and all other online help that we got in order to reach our objective – sentiment analysis & PMI
function.
Sentiment Analysis May 9, 2016
Page 4 of 21
Abstract
Real-time sentiment analysis is a challenging machine learning task, due to scarcity of labeled
data and sudden changes in sentiment caused by real-world events that need to be instantly
interpreted. In this project we propose solutions to save user time that they spend reading all
the reviews about a product. And, help them make a better an instant informed decision. We
also strived to acquire labels and cope with concept drift in this setting, by using findings from
social psychology on how humans prefer to disclose some types of emotions. In particular, we
use findings that humans are more motivated to report positive feelings rather than negative
feelings and also prefer to report extreme feelings rather than average feelings.
The project mainly explains about the gathering and parsing the data, gathering more
information about the about the movie, sentiment analysis done on Amazon movie reviews. The
huge dataset was having around 8 million reviews. The data span is a period of more than 10
years from August 1997 to October 2012.
We show that our Sentiment Analysis – produces accuracies up while analyzing reactions in the
Amazon Movie Reviews debate– despite requiring human effort in generating supervisory labels.
Sentiment Analysis May 9, 2016
Page 5 of 21
Problem Statement
Users have written over hundreds of reviews for each movie. The reviews are expressed in the
natural language, along with a self-annotated score describing the overall sentiment of that
review. To make a better informed decision, user has to go through each of them, which is a time
consuming activity that user is highly unlikely to invest time in.
In this project we’re helping users making a better informed decision, not only based on the
aggregate of the self-annotated data but also by calculating the semantic orientation and polarity
of each review individually.
However, reviews alone could be misleading therefore, we also calculated the Point-wise Mutual
Information score for each movie separately. In conjunction, the final score of the movie was
calculated as the aggregate of the user-annotated data, sentiment score of reviews, and the PMI
score of that movie.
Sentiment Analysis May 9, 2016
Page 6 of 21
Introduction
As social media platforms become the primary medium used by people to express their opinions
and feelings about a multitude of topics that pop up daily on news media, the vast amount of
opinionated data now available in the form of social streams gives us an unprecedented
opportunity to build valuable applications that monitor public opinions and opinion shifts. For
example, a social media platforms can track the human sentiment, something far more appealing
than the relative number of mentions of each team, which is what most movie web sites
currently offer. Creating such applications enrich the personal experience of watching movie,
where watching not only the movie itself, but how others react to it, is part of the experience.
The task of interpreting positive and negative feelings expressed on social streams exhibits a
number of unique characteristics that are all movie reviews and human constraints on generating
a constant flow of labeled messages on streams remain high the distribution of positive and
negative opinions is potentially quite different from the random samples obtained in traditional
opinion polls and survey methodologies.
We built sentiment analysis models that exploit two factors widely described by substantive
research from social psychology and behavioral economics that describe human preferences
when disclosing emotion publicly:
Positive-negative sentiment report imbalance: People tend to express positive feelings more than
negative feelings in social environments. Extreme-average sentiment report imbalance: People
tend to express extreme feelings more than average feelings in social environments. We explore
each of these two self-report imbalances to accomplish a different subtask in learning-based
sentiment analysis.
The first self-report factor, which we call positive-negative sentiment report imbalance
throughout the paper, is employed to acquire labeled data that supports supervised classifiers.
In the context of polarizing groups – a division of the population into groups of people sharing
similar opinions, a positive event for one group tends to be negative to the other, and vice versa
we make a prediction of the current dominant sentiment by simply counting how many members
of each group, relative to group sizes, decided to post a message during the specified time frame
we adopt a probabilistic model that computes the uncertainty of the social context, and, at each
time frame, generates a probabilistic sentiment label, which can then be incorporated into a
range of content-based supervised classifiers.
The second self-report factor we explore is related to the human tendency to report extreme
experiences more than average experiences. The extreme-average sentiment report imbalance
implies an important consequence for real-time sentiment tracking: because extreme feelings
Sentiment Analysis May 9, 2016
Page 7 of 21
stimulate reactions, spikes of activity in streams of opinionated text tend to contain highly
emotional terms, which precisely the features that are helpful for sentiment prediction.
Our experimental studies demonstrate that are better indicators of emerging and strong feelings
than traditional static representations (e.g., TF-IDF), allowing the underlying classification model
to adapt quicker to sudden sentiment drift induced by real-world events As a result, our
framework can be incorporated into sophisticated sentiment classifiers that make use of more
powerful.
Sentiment Analysis May 9, 2016
Page 8 of 21
Data Collection
Dataset Source and Format: Differently from the majority of research on supervised sentiment analysis, which focus on batch
processing of opinionated documents, here we are interested in the setting where the data
arrives as an Amazon Movie Reviews
The dataset was downloaded from http://snap.stanford.edu/data/web-Movies.html and the
text file downloaded was of 3GB zipped file and approximately 9GB when unzipped having more
than 8 million reviews. The movie reviews was uploaded by a professors from Stanford University
J. McAuley and J. Leskovec. During web scrapping, for PMI calculation, Google blocked us,
therefore we scaled down data to 400 reviews from 14 movies to perform sentiment analysis.
The data provided in the original file was in the following format:
As per the data format following below are the details of each column name shown below:
Product/Product Id: This is a unique ID generated by Amazon and assigned to a unique
movie.
User Id: The ID of the user.
Profile Name: The name of the user who has mentioned the movie review.
Helpfulness: The number of users who found the review useful.
Score: The column signifies the rating of the product.
Time: The column signifies the time of the review.
Summary: The summary of the movie.
Text: The comments & reviews written by the user about the movie
Sentiment Analysis May 9, 2016
Page 9 of 21
Problem Thought the data provided was pretty powerful but we encountered few issue while dealing with
it in R. Following is the problem summary.
1. The data format that was downloaded, was not R-friendly:
The raw file downloaded from the website had data in rows format and we had to
transpose it to column format so that it was readable by R.
The data had various breaks, invalid spaces and symbols which was needed to be
removed to do appropriate sentiment analysis.
2. The data context were missing, we had the ProductID, which was an abstract uniqueID
for each movie assigned by amazon itself. However, the important movie information
like, title, genre, etc. were missing from the canvas.
Solution
Parsing the Data in R-Friendly Format: In order to solve our first problem, we had to write a parser to transform the JSON format data
into a CSV, which is easier to deal with in R environment. The parser to perform the job was
written in R itself using the basic concept of loading the text file into data frame. This parser will
check the 8 lines of the row and transpose it to column format eventually transposing all the data
from rows to columns and converting it to CSV file.
Sentiment Analysis May 9, 2016
Page 10 of 21
Following is the output file that we got from the parser,
Getting additional supporting data: The data that we had initially were not given us an information about exact movie name and
genre to perform PMI calculations. We had the unique movie ID in the set, which was describing
each movie. So, we pulled the supporting information about that entity from Amazon Product
Advertising API by giving that unique Product_ID.
Below code block is snip of the function which is calling the itemLookup function of the AWS
(Product Advertising API). And, highlighted are the data elements we’re getting in the end. Same
is being exported in the excel sheet in the later steps as well.
Sentiment Analysis May 9, 2016
Page 11 of 21
To retrieve this information, a middleware NodeJS is developed to gather more
information about the movie using Amazon Web Services and Product ID as shown in the
above Figure(3).
As shown in the above Figure(4), after parsing, and gathering more data using Amazon Web Service, we get two files one with movie reviews which is in parsed CSV format and other file having movie details which has new information like title, genre, audience rating, release date , running time and director.
Sentiment Analysis May 9, 2016
Page 12 of 21
Model Selection
The purpose of the project is to help users in making informed decision, not only based on the
aggregate of the self-annotated data but also by calculating the semantic orientation and polarity
of each review individually.
However, reviews alone could be misleading therefore, we also calculated the Point-wise Mutual
Information score for each movie separately. In conjunction, the final score of the movie was
calculated as the aggregate of the user-annotated data, sentiment score of reviews, and the PMI
score of that movie.
To get the final result, the extracted data was passed through the following steps:
Getting basic sentimental score for each review. Package used: Syuzhet package
Description:
The package comes with four sentiment dictionaries and provides a method for accessing
the robust, but computationally expensive, sentiment extraction tool developed in the
NLP group at Stanford.
It provides 4 types of method namely: Bing, afinn, nrc, Stanford.
Afinn method was used for our project
Sentiment Analysis May 9, 2016
Page 13 of 21
Creating word cloud for every movie. Package used: WordCloud, tm,
SnowballC, RColorBrewer package
Description:
Combined all reviews into one
variable, calculated term
frequency & generated
WordCloud images.
Before generating wordcloud
we removed the stop words
from review as well
Word cloud generated was
multi colored, where each
color is describing a certain
term frequency.
Determining Point-wise Mutual Information (PMI) sentiment score for each movie. Package used: RCurl
Description:
Provides functions to allow one to compose
general HTTP requests and provides
convenient functions to fetch URIs, get &
post forms, etc. and process the results
returned by the Web server.
Code written from scratch as per the
project requirement.
Web scrapping was with respect to
calculate PMI scores were determined for
Movie_title and Movie_Genre
Ratio of Movie_title/Movie_Genre was
used for the final score.
Sentiment Analysis May 9, 2016
Page 14 of 21
Aggregating all the sentiment scores. Description:
• Took Median of all the users review score.
• Took Median of all
the users review text
sentiment score
Assigning an overall sentiment score to each movie. Description:
For this median of 3 parameters were taken and a final score was generated for each movie.
Parameters considered:
Aggregated Self-Annotated Score.
Aggregated Sentiment Score for each review.
Calculated Semantic Orientation Score of the Movie by the Movie Title PMI as per the Genre
PMI score.
Sentiment Analysis May 9, 2016
Page 15 of 21
Result Overview
In conclusion, we managed to come up with an aggregated score for each movie based on our
model. These aggregated scores can describe the movie in a much better way than the other
provided scores. Because, it takes all accounts in consideration, that is from the user perspective
as well as the particular movie performing as being a certain genre based movie.
We also generated a WordCloud which is a better representation of the most common words
mentioned by the users for that certain entity. These wordclouds can be added alongside with the
aggregate ratings. The power of wordcloud is that, it can guide users with the topics that the movie
is related to, that is, another deciding factor for choosing what movie to watch. For example, in
drama movie genre, it is “Family Politics” or “Rape”, etc.
Following page has the snap shots of the result files,
Sentiment Analysis May 9, 2016
Page 16 of 21
Reviews associated to one movie and all its User_Sentiment_Score and PMI score is
processed to give us an output as follows:
The processed data gives the overall rating based on the user score, PMI and user
sentiment score.
Sentiment Analysis May 9, 2016
Page 17 of 21
Word Cloud: Few word cloud generated based on particular movies.
Sentiment Analysis May 9, 2016
Page 18 of 21
Value Obtained As mentioned in our problem statement, we have achieved our results in order to provide apt and
correct solution to the genuine problems faced by people while going through the reviews. Our
results were able to not only reviews scores based on sentiment analysis & PMI function, but also
could provide visualized word clouds for each & every movie.
Achievements Through this project, we deep dived in the concept of sentiment analysis. Also, realized the
importance and role sentiment analysis in every-day life. We could perform the normal sentiment
analysis and PMI function on our dataset without much complications.
But, adding complexities didn’t only cleared our concepts of sentiment analysis but also helped us
getting familiar with ‘R’ language. We also learned about the different packages available out-of-
box, and how to use them in achieving our results.
Sentiment Analysis May 9, 2016
Page 19 of 21
Scope for Improvement
Identification of accurate review analysis through Plot Trajectory: This is the most important future scope of our project wherein we can find accurate reviews from
both positive & negative reviews. Plot trajectory is a plot wherein each review is converted into a
graph plot. This helps the analysts to understand & summarize from the reviews. Also, this will help
the analysts to find out negative feedback from positive reviews & positive feedback from negative
reviews. So the analysts will be able to find out the exact problems the product has.
For Example, Any consumer who owns a Dell Laptop can give a review as:
“Dell Laptops are excellent to use and they are the most durable, however, if Dell could figure out
the solution to the problem of heating in their laptops, then they would be even better.”
Under normal circumstances, this would be considered a positive review however, it has 1 negative
part in it. With the above plot trajectory, the minimum point can be taken as a feedback to work
upon by the product managers and that can give excellent results.
Word Clouds based on a certain Part-of-Speech Another future scope of our project is to focus on the word clouds emphasizing primarily on any
given Part-of-Speech, for example, adjectives or adverbs.
Sentiment Analysis May 9, 2016
Page 20 of 21
After performing POS tagging & filtering the cloud based on adjectives. The current word clouds
have many other parts of speech which might not lead to accurate management decisions.
However, when targeted at the correct adjectives, will help product managers to focus on the key
areas in order to market the product.
For Example: Lets say you have a cloud of 150 words for a particular product. So if the adjectives
are targeted, then it can hit Bullseye.
Other The reviews and sentiment score are limited to only Amazon movie reviews. We can do
sentiment analysis and compare the same with other movie review website like IMDB.
We can do the sentiment analysis based on other categories (e.g. director) and also find out
the user sentiments based on that category. Performance optimization can be done to provide
more accurate user sentiment score for each movie by including more reviews in the dataset
(Currently we have only 400 records).
Sentiment Analysis May 9, 2016
Page 21 of 21
Citation
Dataset: http://snap.stanford.edu/data/web-Movies.html
WordCloud: http://www.r-bloggers.com/building-wordclouds-in-r/
Lectures for topic understanding
Google for general searches throughout