machine learning from movie reviews - long form

Machine LearningFrom Movie ReviewsJennifer Dunne

Why Movie Reviews?• Natural Language Processing is hot

• There are real world use cases

• It plays to my domain knowledge outsideof data science

Natural Language Processing Use CasesWhat Are

People Saying About Me?

Who Are These

People?

Sentiment AnalysisWhat Are

People Saying About Me?

Good Bad

Customer Segmentation

Who Are These

People?Animated Action Comedy Drama

Family Fantasy Horror Musical

Mystery Romance Sci-Fi Thriller

War Western

Sentiment Analysis Testing

• 100,000 movie reviews from IMDB

• Training set of 12,500 positive reviews (7-10 stars)

• Training set of 12,500 negative reviews (<5 stars)

• 30 or fewer reviews per movie

Methods

• Bag of words (Sklearn TF/IDF)

• Word2Vec (Gensim)

• Doc2Vec (Gensim)

• Pattern

• Indico (by sentence)

• Indico (by document)

• Indico (High Quality Sentiment)

Speed• How long does it take to

train?◦Preparing text ◦ Training machine learning

model ◦Some models are pre-trained

• How quickly does it analyze?◦Preparing text ◦Running text through the

trained model

How well does it do?• Accuracy

◦Did we correctly name positive sentiment as positive?

◦Did we correctly name negative sentiment as negative?

◦Better for even class distribution

• F1 = (Precision + Recall) / 2◦Precision = percent of things

we called positive that were actually positive

◦Recall = percent of things that were actually positive that we called positive

◦Better for uneven class distribution

Bag Of Words (Sklearn TFIDF)• Simple algorithm

• Fast to train (10 minutes)

• Fast to apply

• 85.3% accuracy

• 85.3% F1

Word2Vec (Gensim)• More complex algorithm

• Computationally intensive

• Better results with larger training sets, multiple epochs

• Slow to train (2 hours)

• Slow to apply

• 81.9% accuracy

• 82.2% F1

Doc2Vec (Gensim)• More complex algorithm

• Computationally intensive

• Better results with larger training sets, multiple epochs

• Slow to train (4 hours)

• Slow to apply

• 82.8% accuracy

• 82.8% F1

• Distributed bag of words

• (other models 70% and 82% accuracy rates)

Pattern (built-in)• Simple algorithm

• Part of the Pattern module

• No training required

• Fast to apply

• 76.4% accuracy

• 76.9% F1

• (lowest scores)

Indico (by sentence)• More complex algorithm

• API calls to proprietary system

• Fast to apply

• 89.1% accuracy

• 88.9% F1

Indico (by document)• Simple algorithm

• Fast to apply

• 90.1% accuracy

• 90.0% F1

Indico (High Quality Sentiment)• Simple algorithm

• Slow to apply

• 93.2% accuracy

• 93.2% F1

• (highest scores)

Indico Sent_HQ

Indico by doc.

Doc2Vec

Pattern

Indico by sent.

Word2Vec

Bag of Words

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Accuracy F1

Comparison of Sentiment Prediction

Customer Segmentation

Who Are These

People?Animated Action Comedy Drama

War Western

Customer Segmentation Testing

• 129,809 movie reviews from IMDB

• 3,323 different movies

• 510 different combinations of genre

• 40 or fewer reviews per movie

Methodology

◦ Transform each review into an Indico document vector

◦ Test success of different document criteria

◦ Test success of different models◦ Optimize best model

What Is Random Chance?• 510 different genre combinations

• Heavily weighted to negative

• F1 random chance < 20%

Animated Action Comedy Drama

War Western

More Reviews or More Words?Number of Reviews Word Length F1 Score

129,809 All Reviews .5472,691 200+ .5619,914 500+ .57

Conclusion: Longer reviews work better

Optimizing ModelsModel F1 Score

Tuned Random Forest .57Initial Logistic Regression, Initial Linear SVC .62

Tuned Logistic Regression, Tuned Linear SVC .63Initial Gradient Boost .63Tuned Gradient Boost .67

Conclusion: Choosing the right model matters more than tuning

machine learning from movie reviews - long form

Documents

from movie reviews to restaurants...

feb 5th, 2006: movie reviews

movie reviews: to read or not to read! spoiler detection...

movie recommendation system using machine learning

best bread machine reviews

film companion | movie reviews | celebrity interviews

2012 movie reviews

sweeney todd script - movie news | movie reviews | movie...

baseline movie reviews - "mission impossible," four others

sentiment analysis of movie reviews in...

kettler favorit rowing machine reviews

classification of movie reviews using complemented naive...

ballychohanmovie.co.uk: movie reviews, showtimes and...

the time machine movie trailer

unigram polarity estimation of movie reviews using maximum...

vaikundarajan reviews the movie lion

and soon the darkness (2010) movie review · and soon the...

website design pitch for take 2 movie reviews

cpap machine reviews 2013/14

ovms movie reviews