applying word vectors sentiment analysis
TRANSCRIPT
Applying Word Vectors forSentiment Analysis
&Text Analysis while Browsing
Abdullah Khan ZehadyDepartment Of Computer Science,
Purdue University
Movie Review- Sentiment Analysis
● Collected from Kaggle ML Competition.● Data
o “Review Index” “Review” “Sentiment( 0/1)”1. LabeledTrainData
● 25000 movie reviews1. TestData
● 25000 movie reviews
Approach 1: Bag Of Word - Baseline
● Data Preprocessingo Removal of HTML, Non-Letters, Stopwords, space +
LowerCase conversion ● Creating Features from Bag Of Words
o 5000 most freq words (25000 x 5000)o { the, cat, sat, on, hat, dog, ate, and } ---> { 2, 1, 1, 1, 1, 0, 0, 0 }o { the, cat, sat, on, hat, dog, ate, and } ---> { 3, 1, 0, 0, 1, 1, 1, 1}
● Supervised Learningo Random Forest Classifier with 100 trees
Approach 2: TF-IDF Word WeightApproach 3: Vector Averaging
● Review Vector ← TF-IDF word weight ● Word2Vec word vectors (Dim = 300)
o Review Vector ← Element wise Average
Approach 4: Bag Of Centroids
● K-Means Clustering to find word clusters● Number of Features = Number of Clusters● Review Feature Vector
o Find which feature a word belongs to and increase the cluster value.
Approach 5: Clustering + Pretrained Vector
+ External Sentiment Dict.
● Pre-trained Data (using word2vec)o Entity vectors trained on 100B words from various news articles:
freebase-vectors-skipgram1000.bin.gz o pre-trained vectors trained on part of Google News dataset (about 100 billion words)
● Word2Vec “distance”, “most_similar” to lookup close words + find review tones
● Incorporating “Sentiwordnet” informationo Positive, Negative Score for each word
Result
Method Accuracy
Bag Of Words 0.84
TF-IDF 0.74
Vector Averaging 0.63
Bag Of Centroids 0.81
PreTrain + Ext. Knowledge 0.87
Page Analysis Chrome Extension
● Important Word List● Important Named Entities● Tag Distribution● Summarization of Text● Sentiment Analysis
○ Comment Analysis
A useful tool everybody will be able to use to extract meaningful information from a webpage.
Future Work● Implementation of RNN, LSTM-RNN, Paragraph Vector
o Y Bengio, R Ducharme, P Vincent… - The Journal of Machine …, 2003 - dl.acm.org
o P Le, W Zuidema - COLING, 2012o QV Le, T Mikolov, 2014
● Relational inference for wikificationo Disambiguation to Wikipedia
Pr(title|surface) o Candidate title <- Compositional Semantics for candidate wiki page
● Extension: Reranking Google Search result using information visualization.