sentiment analysis on amazon movie reviews dataset

Sentiment Analysis May 9, 2016

of 21

IS 688 : Web Mining – Spring 2016

Sentiment Analysis Amazon Movie Reviews Dataset Project Report – Group 1 Professor: Christopher Markson

Team : Amit | Maham | Mashael | Karan | Nidhish 5-9-2016


of 21

Table of Contents

Acknowledgement .......................................................................................................................... 3

Abstract........................................................................................................................................... 4

Problem Statement ........................................................................................................................ 5

Introduction .................................................................................................................................... 6

Data Collection ............................................................................................................................... 8

Dataset Source and Format: ........................................................................................................ 8

Problem ....................................................................................................................................... 9

Solution ....................................................................................................................................... 9

Parsing the Data in R-Friendly Format: .................................................................................. 9

Getting additional supporting data: ..................................................................................... 10

Model Selection ............................................................................................................................ 12

Getting basic sentimental score for each review. ..................................................................... 12

Creating word cloud for every movie. ....................................................................................... 13

Determining Point-wise Mutual Information (PMI) sentiment score for each movie. ............. 13

Aggregating all the sentiment scores. ....................................................................................... 14

Assigning an overall sentiment score to each movie. ............................................................... 14

Result Overview ............................................................................................................................ 15

Value Obtained .......................................................................................................................... 18

Achievements ............................................................................................................................ 18

Scope for Improvement ................................................................................................................ 19

Identification of accurate review analysis through Plot Trajectory: ......................................... 19

Word Clouds based on a certain Part-of-Speech ...................................................................... 19

Other ......................................................................................................................................... 20

Citation ......................................................................................................................................... 21


of 21

Acknowledgement

We would like to express our gratitude and appreciation to Professor Christopher Markson,

who doesn’t only gave us the responsibility to complete this report, but also helped us in

completing it.

A special thanks to him for R-Bloggers in order to complete this project report successfully.

We also would also like to acknowledge our act of gratitude to all the R-Packages, R-Bloggers

and all other online help that we got in order to reach our objective – sentiment analysis & PMI

function.


of 21

Abstract

Real-time sentiment analysis is a challenging machine learning task, due to scarcity of labeled

data and sudden changes in sentiment caused by real-world events that need to be instantly

interpreted. In this project we propose solutions to save user time that they spend reading all

the reviews about a product. And, help them make a better an instant informed decision. We

also strived to acquire labels and cope with concept drift in this setting, by using findings from

social psychology on how humans prefer to disclose some types of emotions. In particular, we

use findings that humans are more motivated to report positive feelings rather than negative

feelings and also prefer to report extreme feelings rather than average feelings.

The project mainly explains about the gathering and parsing the data, gathering more

information about the about the movie, sentiment analysis done on Amazon movie reviews. The

huge dataset was having around 8 million reviews. The data span is a period of more than 10

years from August 1997 to October 2012.

We show that our Sentiment Analysis – produces accuracies up while analyzing reactions in the

Amazon Movie Reviews debate– despite requiring human effort in generating supervisory labels.


of 21

Problem Statement

Users have written over hundreds of reviews for each movie. The reviews are expressed in the

natural language, along with a self-annotated score describing the overall sentiment of that

review. To make a better informed decision, user has to go through each of them, which is a time

consuming activity that user is highly unlikely to invest time in.

In this project we’re helping users making a better informed decision, not only based on the

aggregate of the self-annotated data but also by calculating the semantic orientation and polarity

of each review individually.

However, reviews alone could be misleading therefore, we also calculated the Point-wise Mutual

Information score for each movie separately. In conjunction, the final score of the movie was

calculated as the aggregate of the user-annotated data, sentiment score of reviews, and the PMI

score of that movie.


of 21

Introduction

As social media platforms become the primary medium used by people to express their opinions

and feelings about a multitude of topics that pop up daily on news media, the vast amount of

opinionated data now available in the form of social streams gives us an unprecedented

opportunity to build valuable applications that monitor public opinions and opinion shifts. For

example, a social media platforms can track the human sentiment, something far more appealing

than the relative number of mentions of each team, which is what most movie web sites

currently offer. Creating such applications enrich the personal experience of watching movie,

where watching not only the movie itself, but how others react to it, is part of the experience.

The task of interpreting positive and negative feelings expressed on social streams exhibits a

number of unique characteristics that are all movie reviews and human constraints on generating

a constant flow of labeled messages on streams remain high the distribution of positive and

negative opinions is potentially quite different from the random samples obtained in traditional

opinion polls and survey methodologies.

We built sentiment analysis models that exploit two factors widely described by substantive

research from social psychology and behavioral economics that describe human preferences

when disclosing emotion publicly:

Positive-negative sentiment report imbalance: People tend to express positive feelings more than

negative feelings in social environments. Extreme-average sentiment report imbalance: People

tend to express extreme feelings more than average feelings in social environments. We explore

each of these two self-report imbalances to accomplish a different subtask in learning-based

sentiment analysis.

The first self-report factor, which we call positive-negative sentiment report imbalance

throughout the paper, is employed to acquire labeled data that supports supervised classifiers.

In the context of polarizing groups – a division of the population into groups of people sharing

similar opinions, a positive event for one group tends to be negative to the other, and vice versa

we make a prediction of the current dominant sentiment by simply counting how many members

of each group, relative to group sizes, decided to post a message during the specified time frame

we adopt a probabilistic model that computes the uncertainty of the social context, and, at each

time frame, generates a probabilistic sentiment label, which can then be incorporated into a

range of content-based supervised classifiers.

The second self-report factor we explore is related to the human tendency to report extreme

experiences more than average experiences. The extreme-average sentiment report imbalance

implies an important consequence for real-time sentiment tracking: because extreme feelings


of 21

stimulate reactions, spikes of activity in streams of opinionated text tend to contain highly

emotional terms, which precisely the features that are helpful for sentiment prediction.

Our experimental studies demonstrate that are better indicators of emerging and strong feelings

than traditional static representations (e.g., TF-IDF), allowing the underlying classification model

to adapt quicker to sudden sentiment drift induced by real-world events As a result, our

framework can be incorporated into sophisticated sentiment classifiers that make use of more

powerful.


of 21

Data Collection

Dataset Source and Format: Differently from the majority of research on supervised sentiment analysis, which focus on batch

processing of opinionated documents, here we are interested in the setting where the data

arrives as an Amazon Movie Reviews

The dataset was downloaded from http://snap.stanford.edu/data/web-Movies.html and the

text file downloaded was of 3GB zipped file and approximately 9GB when unzipped having more

than 8 million reviews. The movie reviews was uploaded by a professors from Stanford University

J. McAuley and J. Leskovec. During web scrapping, for PMI calculation, Google blocked us,

therefore we scaled down data to 400 reviews from 14 movies to perform sentiment analysis.

The data provided in the original file was in the following format:

As per the data format following below are the details of each column name shown below:

Product/Product Id: This is a unique ID generated by Amazon and assigned to a unique

movie.

User Id: The ID of the user.

Profile Name: The name of the user who has mentioned the movie review.

Helpfulness: The number of users who found the review useful.

Score: The column signifies the rating of the product.

Time: The column signifies the time of the review.

Summary: The summary of the movie.

Text: The comments & reviews written by the user about the movie

http://snap.stanford.edu/data/web-Movies.html


of 21

Problem Thought the data provided was pretty powerful but we encountered few issue while dealing with

it in R. Following is the problem summary.

1. The data format that was downloaded, was not R-friendly:

The raw file downloaded from the website had data in rows format and we had to

transpose it to column format so that it was readable by R.

The data had various breaks, invalid spaces and symbols which was needed to be

removed to do appropriate sentiment analysis.

2. The data context were missing, we had the ProductID, which was an abstract uniqueID

for each movie assigned by amazon itself. However, the important movie information

like, title, genre, etc. were missing from the canvas.

Solution

Parsing the Data in R-Friendly Format: In order to solve our first problem, we had to write a parser to transform the JSON format data

into a CSV, which is easier to deal with in R environment. The parser to perform the job was

written in R itself using the basic concept of loading the text file into data frame. This parser will

check the 8 lines of the row and transpose it to column format eventually transposing all the data

from rows to columns and converting it to CSV file.


of 21

Following is the output file that we got from the parser,

Getting additional supporting data: The data that we had initially were not given us an information about exact movie name and

genre to perform PMI calculations. We had the unique movie ID in the set, which was describing

each movie. So, we pulled the supporting information about that entity from Amazon Product

Advertising API by giving that unique Product_ID.

Below code block is snip of the function which is calling the itemLookup function of the AWS

(Product Advertising API). And, highlighted are the data elements we’re getting in the end. Same

is being exported in the excel sheet in the later steps as well.


of 21

To retrieve this information, a middleware NodeJS is developed to gather more

information about the movie using Amazon Web Services and Product ID as shown in the

above Figure(3).

As shown in the above Figure(4), after parsing, and gathering more data using Amazon Web Service, we get two files one with movie reviews which is in parsed CSV format and other file having movie details which has new information like title, genre, audience rating, release date , running time and director.


of 21

Model Selection

The purpose of the project is to help users in making informed decision, not only based on the

aggregate of the self-annotated data but also by calculating the semantic orientation and polarity

of each review individually.

However, reviews alone could be misleading therefore, we also calculated the Point-wise Mutual

Information score for each movie separately. In conjunction, the final score of the movie was

calculated as the aggregate of the user-annotated data, sentiment score of reviews, and the PMI

score of that movie.

To get the final result, the extracted data was passed through the following steps:

Getting basic sentimental score for each review. Package used: Syuzhet package

Description:

The package comes with four sentiment dictionaries and provides a method for accessing

the robust, but computationally expensive, sentiment extraction tool developed in the

NLP group at Stanford.

It provides 4 types of method namely: Bing, afinn, nrc, Stanford.

Afinn method was used for our project


of 21

Creating word cloud for every movie. Package used: WordCloud, tm,

SnowballC, RColorBrewer package

Description:

Combined all reviews into one

variable, calculated term

frequency & generated

WordCloud images.

Before generating wordcloud

we removed the stop words

from review as well

Word cloud generated was

multi colored, where each

color is describing a certain

term frequency.

Determining Point-wise Mutual Information (PMI) sentiment score for each movie. Package used: RCurl

Description:

Provides functions to allow one to compose

general HTTP requests and provides

convenient functions to fetch URIs, get &

post forms, etc. and process the results

returned by the Web server.

Code written from scratch as per the

project requirement.

Web scrapping was with respect to

calculate PMI scores were determined for

Movie_title and Movie_Genre

Ratio of Movie_title/Movie_Genre was

used for the final score.


of 21

Aggregating all the sentiment scores. Description:

• Took Median of all the users review score.

• Took Median of all

the users review text

sentiment score

Assigning an overall sentiment score to each movie. Description:

For this median of 3 parameters were taken and a final score was generated for each movie.

Parameters considered:

Aggregated Self-Annotated Score.

Aggregated Sentiment Score for each review.

Calculated Semantic Orientation Score of the Movie by the Movie Title PMI as per the Genre

PMI score.


of 21

Result Overview

In conclusion, we managed to come up with an aggregated score for each movie based on our

model. These aggregated scores can describe the movie in a much better way than the other

provided scores. Because, it takes all accounts in consideration, that is from the user perspective

as well as the particular movie performing as being a certain genre based movie.

We also generated a WordCloud which is a better representation of the most common words

mentioned by the users for that certain entity. These wordclouds can be added alongside with the

aggregate ratings. The power of wordcloud is that, it can guide users with the topics that the movie

is related to, that is, another deciding factor for choosing what movie to watch. For example, in

drama movie genre, it is “Family Politics” or “Rape”, etc.

Following page has the snap shots of the result files,


of 21

Reviews associated to one movie and all its User_Sentiment_Score and PMI score is

processed to give us an output as follows:

The processed data gives the overall rating based on the user score, PMI and user

sentiment score.


of 21

Word Cloud: Few word cloud generated based on particular movies.


of 21

Value Obtained As mentioned in our problem statement, we have achieved our results in order to provide apt and

correct solution to the genuine problems faced by people while going through the reviews. Our

results were able to not only reviews scores based on sentiment analysis & PMI function, but also

could provide visualized word clouds for each & every movie.

Achievements Through this project, we deep dived in the concept of sentiment analysis. Also, realized the

importance and role sentiment analysis in every-day life. We could perform the normal sentiment

analysis and PMI function on our dataset without much complications.

But, adding complexities didn’t only cleared our concepts of sentiment analysis but also helped us

getting familiar with ‘R’ language. We also learned about the different packages available out-of-

box, and how to use them in achieving our results.


of 21

Scope for Improvement

Identification of accurate review analysis through Plot Trajectory: This is the most important future scope of our project wherein we can find accurate reviews from

both positive & negative reviews. Plot trajectory is a plot wherein each review is converted into a

graph plot. This helps the analysts to understand & summarize from the reviews. Also, this will help

the analysts to find out negative feedback from positive reviews & positive feedback from negative

reviews. So the analysts will be able to find out the exact problems the product has.

For Example, Any consumer who owns a Dell Laptop can give a review as:

“Dell Laptops are excellent to use and they are the most durable, however, if Dell could figure out

the solution to the problem of heating in their laptops, then they would be even better.”

Under normal circumstances, this would be considered a positive review however, it has 1 negative

part in it. With the above plot trajectory, the minimum point can be taken as a feedback to work

upon by the product managers and that can give excellent results.

Word Clouds based on a certain Part-of-Speech Another future scope of our project is to focus on the word clouds emphasizing primarily on any

given Part-of-Speech, for example, adjectives or adverbs.


of 21

After performing POS tagging & filtering the cloud based on adjectives. The current word clouds

have many other parts of speech which might not lead to accurate management decisions.

However, when targeted at the correct adjectives, will help product managers to focus on the key

areas in order to market the product.

For Example: Lets say you have a cloud of 150 words for a particular product. So if the adjectives

are targeted, then it can hit Bullseye.

Other The reviews and sentiment score are limited to only Amazon movie reviews. We can do

sentiment analysis and compare the same with other movie review website like IMDB.

We can do the sentiment analysis based on other categories (e.g. director) and also find out

the user sentiments based on that category. Performance optimization can be done to provide

more accurate user sentiment score for each movie by including more reviews in the dataset

(Currently we have only 400 records).


of 21

Citation

Dataset: http://snap.stanford.edu/data/web-Movies.html

WordCloud: http://www.r-bloggers.com/building-wordclouds-in-r/

Lectures for topic understanding

Google for general searches throughout

http://snap.stanford.edu/data/web-Movies.html

http://www.r-bloggers.com/building-wordclouds-in-r/

sentiment analysis on amazon movie reviews dataset

Data & Analytics